IBM power PS700, PS704, PS703 Problem Determination And Service Manual

Power Systems
Problem Determination and Ser vice Guide for the IBM Power PS700 (8406-70Y)

GI11-9831-00
Power Systems
Problem Determination and Ser vice Guide for the IBM Power PS700 (8406-70Y)

GI11-9831-00
Note
Before using this information and the product it supports, read the information in “Notices,” on page 271, “Safety notices” on page v, the IBM Systems Safety Notices manual, G229-9054, and the IBM Environmental Notices and User Guide, Z125–5823.
This edition applies to IBM Power Systems servers that contain the POWER7 processor and to all associated models.
© Copyright IBM Corporation 2010, 2011.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contents

Safety notices ............v
Chapter 1. Introduction ........1
Related documentation ...........1
Notices and statements ...........2
Features and specifications..........2
Supported DIMMs ............4
Blade server control panel buttons and LEDs . . . 5
Turning on the blade server .........6
Turning off the blade server .........7
System-board layouts ...........8
System-board connectors .........8
System-board LEDs ...........9
Chapter 2. Diagnostics ........11
Diagnostic tools .............11
Collecting dump data ...........13
Location codes .............14
Reference codes .............15
System reference codes (SRCs) .......16
1xxxyyyy SRCs ...........17
6xxxyyyy SRCs ...........21
A1xxyyyy service processor SRCs .....24
AA00E1A8 to AA260005 Partition firmware
attention codes ...........25
Bxxxxxxx Service processor early termination
SRCs ..............28
B200xxxx Logical partition SRCs .....29
B700xxxx Licensed internal code SRCs . . . 39 BA000010 to BA400002 Partition firmware
SRCs ..............48
POST progress codes (checkpoints) .....84
C1001F00 to C1645300 Service processor
checkpoints ............85
C2001000 to C20082FF Virtual service
processor checkpoints .........93
IPL status progress codes .......102
C700xxxx Server firmware IPL status
checkpoints ...........102
CA000000 to CA2799FF Partition firmware
checkpoints ............102
D1001xxx to D1xx3FFF Service processor
dump codes ............120
D1xx3y01 to D1xx3yF2 Service processor
dump codes ...........125
D1xx900C to D1xxC003 Service processor
power-off checkpoints ........128
Service request numbers (SRNs) ......129
Using the SRN tables .........129
101-711 through FFC-725 SRNs .....129
A00-FF0 through A24-xxx SRNs .....157
SCSD Devices SRNs (ssss-102 to ssss-640) 177 Failing function codes 151 through 2E33 . . 181
Error logs ..............183
Checkout procedure ...........184
About the checkout procedure.......184
Performing the checkout procedure .....184
Verifying the partition configuration......186
Running the diagnostics program ......186
Starting AIX concurrent diagnostics .....186
Starting stand-alone diagnostics from a CD . . 187
Starting stand-alone diagnostics from a NIM
server ...............188
Using the diagnostics program ......189
Boot problem resolution..........190
Troubleshooting tables ..........191
General problems ...........191
Drive problems............192
Intermittent problems .........192
Management module service processor
problems ..............193
Memory problems ...........193
Microprocessor problems ........194
Network connection problems.......194
PCI expansion card (PIOCARD) problem
isolation procedure ..........194
Optional device problems ........196
Power problems ...........196
POWER Hypervisor (PHYP) problems ....198
Service processor problems........200
Software problems...........213
Universal Serial Bus (USB) port problems . . . 213
Light path diagnostics ..........214
Viewing the light path diagnostic LEDs . . . 214
Light path diagnostics LEDs .......215
Isolating firmware problems ........218
Save vfchost map data ..........218
Restore vfchost map data .........219
Recovering the system firmware .......220
Starting the PERM image ........220
Starting the TEMP image ........221
Recovering the TEMP image from the PERM
image ...............221
Verifying the system firmware levels ....221
Committing the TEMP system firmware image 222 Solving shared BladeCenter resource problems . . 222
Solving shared media tray problems.....223
Solving shared network connection problems 225
Solving shared power problems ......225
Solving shared video problems ......226
Solving undetermined problems .......227
Calling IBM for service ..........228
Chapter 3. Parts listing, Type 8406 229
Chapter 4. Removing and replacing
blade server components ......233
Installation guidelines ..........233
System reliability guidelines .......234
Handling static-sensitive devices ......234
© Copyright IBM Corp. 2010, 2011 iii
Returning a device or component .....234
Removing the blade server from a BladeCenter
unit ................235
Installing the blade server in a BladeCenter unit 236
Removing and replacing Tier 1 CRUs .....237
Removing the blade server cover ......237
Installing and closing the blade server cover . . 239
Removing the bezel assembly .......240
Installing the bezel assembly .......240
Removing a drive ...........241
Installing a drive ...........242
Removing a memory module .......244
Installing a memory module .......245
Removing and installing an I/O expansion card 246
Removing a CIOv form-factor expansion card 247 Installing a CIOv form-factor expansion card 247 Removing a combination-form-factor
expansion card ...........249
Installing a combination-form-factor
expansion card ...........249
Removing the battery .........250
Installing the battery ..........251
Removing the disk drive tray .......252
Installing the disk drive tray .......253
Removing the tier 2 management card .....255
Installing the tier 2 management card .....256
Obtaining a PowerVM Virtualization Engine
system technologies activation code ......257
Replacing the FRU system-board and chassis
assembly ...............260
Chapter 5. Configuring .......263
Updating the firmware ..........263
Configuring the blade server ........264
Using the SMS utility...........265
Starting the SMS utility .........265
SMS utility menu choices ........265
Creating a CE login ...........265
Configuring the Gigabit Ethernet controllers . . . 266 Blade server Ethernet controller enumeration . . . 267 MAC addresses for host Ethernet adapters . . . 267
Configuring a RAID array .........268
Updating IBM Director ..........268
Appendix. Notices .........271
Trademarks ..............272
Electronic emission notices .........273
Class A Notices............273
Class B Notices............277
Terms and conditions...........280
iv Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)

Safety notices

Safety notices may be printed throughout this guide: v DANGER notices call attention to a situation that is potentially lethal or extremely hazardous to
people.
v CAUTION notices call attention to a situation that is potentially hazardous to people because of some
existing condition.
v Attention notices call attention to the possibility of damage to a program, device, system, or data.
World Trade safety information
Several countries require the safety information contained in product publications to be presented in their national languages. If this requirement applies to your country, a safety information booklet is included in the publications package shipped with the product. The booklet contains the safety information in your national language with references to the U.S. English source. Before using a U.S. English publication to install, operate, or service this product, you must first become familiar with the related safety information in the booklet. You should also refer to the booklet any time you do not clearly understand any safety information in the U.S. English publications.
German safety information
Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne§2der Bildschirmarbeitsverordnung geeignet.
Laser safety information
IBM®servers can use I/O cards or features that are fiber-optic based and that utilize lasers or LEDs.
Laser compliance
IBM servers may be installed inside or outside of an IT equipment rack.
© Copyright IBM Corp. 2010, 2011 v
DANGER
When working on or around the system, observe the following precautions:
Electrical voltage and current from power, telephone, and communication cables are hazardous. To avoid a shock hazard: v Connect power to this unit only with the IBM provided power cord. Do not use the IBM
provided power cord for any other product.
v Do not open or service any power supply assembly. v Do not connect or disconnect any cables or perform installation, maintenance, or reconfiguration
of this product during an electrical storm.
v The product might be equipped with multiple power cords. To remove all hazardous voltages,
disconnect all power cords.
v Connect all power cords to a properly wired and grounded electrical outlet. Ensure that the outlet
supplies proper voltage and phase rotation according to the system rating plate.
v Connect any equipment that will be attached to this product to properly wired outlets. v When possible, use one hand only to connect or disconnect signal cables. v Never turn on any equipment when there is evidence of fire, water, or structural damage. v Disconnect the attached power cords, telecommunications systems, networks, and modems before
you open the device covers, unless instructed otherwise in the installation and configuration procedures.
v Connect and disconnect cables as described in the following procedures when installing, moving,
or opening covers on this product or attached devices.
To Disconnect:
1. Turn off everything (unless instructed otherwise).
2. Remove the power cords from the outlets.
3. Remove the signal cables from the connectors.
4. Remove all cables from the devices
To Connect:
1. Turn off everything (unless instructed otherwise).
2. Attach all cables to the devices.
3. Attach the signal cables to the connectors.
4. Attach the power cords to the outlets.
5. Turn on the devices.
(D005)
DANGER
vi Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)
Observe the following precautions when working on or around your IT rack system:
v Heavy equipment–personal injury or equipment damage might result if mishandled.
v Always lower the leveling pads on the rack cabinet.
v Always install stabilizer brackets on the rack cabinet.
v To avoid hazardous conditions due to uneven mechanical loading, always install the heaviest
devices in the bottom of the rack cabinet. Always install servers and optional devices starting from the bottom of the rack cabinet.
v Rack-mounted devices are not to be used as shelves or work spaces. Do not place objects on top
of rack-mounted devices.
v Each rack cabinet might have more than one power cord. Be sure to disconnect all power cords in
the rack cabinet when directed to disconnect power during servicing.
v Connect all devices installed in a rack cabinet to power devices installed in the same rack
cabinet. Do not plug a power cord from a device installed in one rack cabinet into a power device installed in a different rack cabinet.
v An electrical outlet that is not correctly wired could place hazardous voltage on the metal parts of
the system or the devices that attach to the system. It is the responsibility of the customer to ensure that the outlet is correctly wired and grounded to prevent an electrical shock.
CAUTION
v Do not install a unit in a rack where the internal rack ambient temperatures will exceed the
manufacturer's recommended ambient temperature for all your rack-mounted devices.
v Do not install a unit in a rack where the air flow is compromised. Ensure that air flow is not
blocked or reduced on any side, front, or back of a unit used for air flow through the unit.
v Consideration should be given to the connection of the equipment to the supply circuit so that
overloading of the circuits does not compromise the supply wiring or overcurrent protection. To provide the correct power connection to a rack, refer to the rating labels located on the equipment in the rack to determine the total power requirement of the supply circuit.
v (For sliding drawers.) Do not pull out or install any drawer or feature if the rack stabilizer brackets
are not attached to the rack. Do not pull out more than one drawer at a time. The rack might become unstable if you pull out more than one drawer at a time.
v (For fixed drawers.) This drawer is a fixed drawer and must not be moved for servicing unless
specified by the manufacturer. Attempting to move the drawer partially or completely out of the rack might cause the rack to become unstable or cause the drawer to fall out of the rack.
(R001)
Safety notices vii
CAUTION: Removing components from the upper positions in the rack cabinet improves rack stability during relocation. Follow these general guidelines whenever you relocate a populated rack cabinet within a room or building:
v Reduce the weight of the rack cabinet by removing equipment starting at the top of the rack
cabinet. When possible, restore the rack cabinet to the configuration of the rack cabinet as you received it. If this configuration is not known, you must observe the following precautions:
– Remove all devices in the 32U position and above.
– Ensure that the heaviest devices are installed in the bottom of the rack cabinet.
– Ensure that there are no empty U-levels between devices installed in the rack cabinet below the
32U level.
v If the rack cabinet you are relocating is part of a suite of rack cabinets, detach the rack cabinet from
the suite.
v Inspect the route that you plan to take to eliminate potential hazards.
v Verify that the route that you choose can support the weight of the loaded rack cabinet. Refer to the
documentation that comes with your rack cabinet for the weight of a loaded rack cabinet.
v Verify that all door openings are at least 760 x 2030 mm (30 x 80 in.).
v Ensure that all devices, shelves, drawers, doors, and cables are secure.
v Ensure that the four leveling pads are raised to their highest position.
v Ensure that there is no stabilizer bracket installed on the rack cabinet during movement.
v Do not use a ramp inclined at more than 10 degrees.
v When the rack cabinet is in the new location, complete the following steps:
– Lower the four leveling pads.
– Install stabilizer brackets on the rack cabinet.
– If you removed any devices from the rack cabinet, repopulate the rack cabinet from the lowest
position to the highest position.
v If a long-distance relocation is required, restore the rack cabinet to the configuration of the rack
cabinet as you received it. Pack the rack cabinet in the original packaging material, or equivalent. Also lower the leveling pads to raise the casters off of the pallet and bolt the rack cabinet to the pallet.
(R002)
(L001)
(L002)
viii Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)
(L003)
or
All lasers are certified in the U.S. to conform to the requirements of DHHS 21 CFR Subchapter J for class 1 laser products. Outside the U.S., they are certified to be in compliance with IEC 60825 as a class 1 laser product. Consult the label on each part for laser certification numbers and approval information.
CAUTION: This product might contain one or more of the following devices: CD-ROM drive, DVD-ROM drive, DVD-RAM drive, or laser module, which are Class 1 laser products. Note the following information:
v Do not remove the covers. Removing the covers of the laser product could result in exposure to
hazardous laser radiation. There are no serviceable parts inside the device.
v Use of the controls or adjustments or performance of procedures other than those specified herein
might result in hazardous radiation exposure.
(C026)
Safety notices ix
CAUTION: Data processing environments can contain equipment transmitting on system links with laser modules that operate at greater than Class 1 power levels. For this reason, never look into the end of an optical fiber cable or open receptacle. (C027)
CAUTION: This product contains a Class 1M laser. Do not view directly with optical instruments. (C028)
CAUTION: Some laser products contain an embedded Class 3A or Class 3B laser diode. Note the following information: laser radiation when open. Do not stare into the beam, do not view directly with optical instruments, and avoid direct exposure to the beam. (C030)
Power and cabling information for NEBS (Network Equipment-Building System) GR-1089-CORE
The following comments apply to the IBM servers that have been designated as conforming to NEBS (Network Equipment-Building System) GR-1089-CORE:
The equipment is suitable for installation in the following:
v Network telecommunications facilities v Locations where the NEC (National Electrical Code) applies
The intrabuilding ports of this equipment are suitable for connection to intrabuilding or unexposed wiring or cabling only. The intrabuilding ports of this equipment must not be metallically connected to the interfaces that connect to the OSP (outside plant) or its wiring. These interfaces are designed for use as intrabuilding interfaces only (Type 2 or Type 4 ports as described in GR-1089-CORE) and require isolation from the exposed OSP cabling. The addition of primary protectors is not sufficient protection to connect these interfaces metallically to OSP wiring.
Note: All Ethernet cables must be shielded and grounded at both ends.
The ac-powered system does not require the use of an external surge protection device (SPD).
The dc-powered system employs an isolated DC return (DC-I) design. The DC battery return terminal shall not be connected to the chassis or frame ground.
x Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)

Chapter 1. Introduction

This problem determination and service information helps you solve problems that might occur in your PS700 blade server. The information describes the diagnostic tools that come with the blade server, error codes and suggested actions, and instructions for replacing failing components.
Replaceable components are of three types: v Tier 1 customer replaceable unit (CRU): Replacement of Tier 1 CRUs is your responsibility. If IBM
installs a Tier 1 CRU at your request, you are charged for the installation.
v Tier 2 customer replaceable unit: You can install a Tier 2 CRU yourself or request IBM to install it, at
no additional charge, under the type of warranty service that is designated for your blade server.
v Field replaceable unit (FRU): FRUs must be installed only by trained service technicians.
The serial number for the PS700 blade server can be found in the following locations:
v The bottom front of the blade server in the right corner on the 1S label. v The bottom rear of the blade server in the right corner. v Under the front cover door.
For information about the terms of the warranty and getting service and assistance, see the information center or the Warranty and Support Information document on the IBM BladeCenter
®
Documentation CD.

Related documentation

Documentation for the PS700 blade server includes documents in Portable Document Format (PDF) on the IBM BladeCenter Documentation CD and the online information center.
The most recent version of all BladeCenter documentation is in the BladeCenter information center.
The online BladeCenter information center is available in the IBM BladeCenter Information Center at http://publib.boulder.ibm.com/infocenter/bladectr/documentation/index.jsp.
PDF versions of the following documents are on the IBM BladeCenter Documentation CD and in the online information center:
v Installation and User's Guide
This document contains general information about the blade server, including how to install supported options and how to configure the blade server.
v Safety Information
This document contains translated caution and danger statements. Each caution and danger statement that appears in the documentation has a number that you can use to locate the corresponding statement in your language in the Safety Information document.
v Warranty and Support Information
This document contains information about the terms of the warranty and about getting service and assistance.
© Copyright IBM Corp. 2010, 2011 1
Additional documents might be included in the online information center and on the IBM BladeCenter Documentation CD.
The blade server might have features that are not described in the documentation that comes with the blade server. Occasional updates to the documentation might include information about those features, or technical updates might be available to provide additional information that is not included in the documentation that comes with the blade server.
Review the online information or the Planning Guide and the Installation Guide for your IBM BladeCenter unit. The information can help you prepare for system installation and configuration. The most current version of each document is available in the BladeCenter information center.

Notices and statements

The caution and danger statements in this document are also in the multilingual Safety Information. Each statement is numbered for reference to the corresponding statement in your language in the Safety Information document.
The following notices and statements are used in this document:
v Note: These notices provide important tips, guidance, or advice. v Important: These notices provide information or advice that might help you avoid inconvenient or
problem situations.
v Attention: These notices indicate potential damage to programs, devices, or data. An attention notice is
placed just before the instruction or situation in which damage might occur.
v Caution: These statements indicate situations that can be potentially hazardous to you. A caution
statement is placed just before the description of a potentially hazardous procedure step or situation.
v Danger: These statements indicate situations that can be potentially lethal or extremely hazardous to
you. A danger statement is placed just before the description of a potentially lethal or extremely hazardous procedure step or situation.

Features and specifications

Features and specifications of the IBM BladeCenter PS700 blade server are summarized in this overview.
The PS700 Type 8406 is a single-wide (non-expandable) blade server. The PS700 blade server is used in an IBM BladeCenter H (8852 and 7989), BladeCenter HT (8740 and 8750), or BladeCenter S (8886 and 7779) chassis unit.
Notes:
v Power, cooling, removable-media drives, external ports, and advanced system management are
provided by the BladeCenter unit.
v The operating system in the blade server must provide support for the Universal Serial Bus (USB), to
enable the blade server to recognize and communicate internally with the removable-media drives and front-panel USB ports.
2 Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)
Core electronics:
v 64-bit Power 7 processors (12S
technology)
v Four core, single socket (4-way)
processors @ 3.0 GHz
v 64 GB maximum in 8 very low
profile (VLP) DIMM slots; Supports 4 GB DDR3 at 1066MHz, and 8 GB DDR3 at 800HMz
P5IOC2 I/O hub
On-board, integrated features:
v Two 1 GB Ethernet ports (HEA)
(two on each side)
v SAS controller v USB 2.0 v 1 Serial over LAN (SOL) console
using FSP
FSP1 Service Processor - IPMI and SOL
v The baseboard management
controller (BMC) is a flexible service processor (FSP1) with Intelligent Platform Management Interface (IPMI), Serial over LAN (SOL), and Wake on LAN (WOL) firmware support.
Local Storage:
v First DASD bay: zero or one 2.5"
SAS HDD
v Second DASD bay: zero or one 2.5"
SAS HDD
v SAS HDDs are 300 GB and 600 GB v Hardware mirroring
Daughter card I/O options:
v 1 1Xe expansion card (CIOv) v SAS Pass-through using 1Xe v 1 High-Speed expansion card
(CFFh)
Integrated functions:
v RS-485 interface for
communication with the management module
v Automatic server restart (ASR) v SOL through FSP v Two Universal Serial Bus (USB
2.0) buses on base planar for communication with removable-media drives
v Optical media available by shared
chassis feature
Environment:
v Air temperature:
– Blade server on: 10° to 35°C
(50° to 95°F). Altitude: 0 to 914 m (3000 ft)
– Blade server on: 10° to 32°C
(50° to 90°F). Altitude: 914 m to 2133 m (3000 ft to 7000 ft)
– Blade server off: -40° to 60°C
(-40° to 140°F)
v Humidity:
– Blade server on: 8% to 80% – Blade server off: 8% to 80%
PS700 Size:
v Height: 24.5 cm (9.7 inches) v Depth: 44.6 cm (17.6 inches) v Width: 30 mm (1.14 inches)
Systems management:
v Supported by BladeCenter chassis
management module
v Front panel LEDs v IBM Director v Hardware Management Console
(HMC)
v Integrated Virtualization Manager
(IVM)
v Energy Scale thermal management
for power management/ oversubscription (throttling) and environmental sensing
v Active Energy Manager
Clusters support for:
v IBM Director v xCat
Virtualization support for:
PowerVM
®
Standard Edition hardware feature, which provides the Integrated Virtualization Manager, Virtual I/O Server, and Director Power Systems
Manager (DPSM).
Reliability and service features:
v Dual alternating current power
supply
v BladeCenter chassis redundant and
hot plug power and cooling modules
v Boot-time processor deallocation v Blade server hot plug v Customer setup and expansion v Automatic reboot on power loss v Internal and ambient temperature
monitors
v ECC, chipkill memory v System management alerts
Electrical input: 12Vdc
See the ServerProven Web site for information about supported operating-system versions and all PS700 blade server optional devices.
Chapter 1. Introduction 3

Supported DIMMs

Each planar in the PS700 blade server contains eight very low profile (VLP) memory connectors for registered dual inline memory modules (RDIMMs). The maximum size for a single DIMM is 8 GB. The total memory capacity ranges for PS700 from a minimum of 4 GB to a maximum of 64 GB.
See Chapter 3, “Parts listing, Type 8406,” on page 229 for memory modules that you can order from IBM.
Memory module rules:
v Install DIMM fillers in unused DIMM slots for proper cooling. v Install DIMMs in pairs (1 and 3, 6 and 8, 2 and 4, 5 and 7) v Both DIMMs in a pair must be the same size, speed, type, and technology. You can mix compatible
DIMMs from different manufacturers.
v Each DIMM within a processor-support group (1-4 and 5-8) must be the same size and speed.
®
v Install only supported DIMMs, as described on the ServerProven
servers/eserver/serverproven/compat/us/.
v Installing or removing DIMMs changes the configuration of the blade server. After you install or
remove a DIMM, the blade server is automatically re-configured, and the new configuration information is stored.
v See “System-board connectors” on page 8 for DIMM connector locations.
Table 1 shows allowable placement of DIMM modules:
Table 1. Memory module combinations
DIMM
count PS700 Base blade planar (P1) DIMM slots
12345678
2 XX
4 XX XX
6 XXXXXX
8 XXXXXXXX
Web site. See http://www.ibm.com/
Figure 1. DIMM connectors. Base unit connectors
4 Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)

Blade server control panel buttons and LEDs

Blade server control panel buttons and LEDs provide operational controls and status indicators.
Note: Figure 2 shows the control-panel door in the closed (normal) position. To access the power-control button, you must open the control-panel door.
Figure 2. Blade server control panel buttons and LEDs
1 Media-tray select button: Press this button to associate the shared BladeCenter unit media tray (removable-media drives and front-panel USB ports) with the blade server. The LED on the button flashes while the request is being processed, then is lit when the ownership of the media tray has been transferred to the blade server. It can take approximately 20 seconds for the operating system in the blade server to recognize the media tray.
If there is no response when you press the media-tray select button, use the management module to determine whether local control has been disabled on the blade server.
Note: The operating system in the blade server must provide USB support for the blade server to recognize and use the removable-media drives and USB ports.
Chapter 1. Introduction 5
2 Information LED: When this amber LED is lit, it indicates that information about a system error for the blade server has been placed in the management-module event log. The information LED can be turned off through the Web interface of the management module or through IBM Director Console.
3 Blade-error LED: When this amber LED is lit, it indicates that a system error has occurred in the blade server. The blade-error LED will turn off after one of the following events:
v Correcting the error v Reseating the blade server in the BladeCenter unit v Cycling the BladeCenter unit power
4 Power-control button: This button is behind the control panel door. Press this button to turn on or turn off the blade server.
The power-control button has effect only if local power control is enabled for the blade server. Local power control is enabled and disabled through the Web interface of the management module.
Press the power button for 5 seconds to begin powering down the blade server.
5 NMI reset (recessed): The nonmaskable interrupt (NMI) reset dumps the partition. Use this recessed button only as directed by IBM Support.
6 Power-on LED: This green LED indicates the power status of the blade server in the following manner:
v Flashing rapidly: The service processor is initializing the blade server. v Flashing slowly: The blade server has completed initialization and is waiting for a power-on command. v Lit continuously: The blade server has power and is turned on.
Note: The enhanced service processor can take as long as three minutes to initialize after you install the BladeCenter PS700 blade server, at which point the LED begins to flash slowly.
7 Activity LED: When this green LED is lit, it indicates that there is activity on the hard disk drive or network.
8 Location LED: When this blue LED is lit, it has been turned on by the system administrator to aid in visually locating the blade server. The location LED can be turned off through the Web interface of the management module or through IBM Director Console.

Turning on the blade server

After you connect the blade server to power through the BladeCenter unit, you can start the blade server after the discovery and initialization process is complete.
6 Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)
You can start the blade server in any of the following ways. v Start the blade server by pressing the power-control button on the front of the blade server.
The power-control button is behind the control panel door, as described in “Blade server control panel buttons and LEDs” on page 5.
After you push the power-control button, the power-on LED continues to blink slowly for about 15 seconds, then is lit solidly when the power-on process is complete.
Wait until the power-on LED on the blade server flashes slowly before you press the blade server power-control button. If the power-on LED is flashing rapidly, the service processor is initializing the blade server. The power-control button does not respond during initialization.
Note: The enhanced service processor can take as long as three minutes to initialize after you install the BladeCenter PS700 blade server, at which point the LED begins to flash slowly.
v Start the blade server automatically when power is restored after a power failure.
If a power failure occurs, the BladeCenter unit and then the blade server can start automatically when power is restored. You must configure the blade server to restart through the management module.
v Start the blade server remotely using the management module.
After you initiate the power-on process, the power-on LED blinks slowly for about 15 seconds, then is lit solidly when the power-on process is complete.

Turning off the blade server

When you turn off the blade server, it is still connected to power through the BladeCenter unit. The blade server can respond to requests from the service processor, such as a remote request to turn on the blade server. To remove all power from the blade server, you must remove it from the BladeCenter unit.
Shut down the operating system before you turn off the blade server. See the operating-system documentation for information about shutting down the operating system.
You can turn off the blade server in one of the following ways. v Turn off the blade server by pressing the power-control button for at least 5 seconds.
The power-control button is on the blade server behind the control panel door. See “Blade server control panel buttons and LEDs” on page 5 for the location.
Note: The power-control LED can remain on solidly for up to 1 minute after you push the power-control button. After you turn off the blade server, wait until the power-control LED is blinking slowly before you press the power-control button to turn on the blade server again.
If the operating system stops functioning, press and hold the power-control button for more than 5 seconds to force the blade server to turn off.
v Use the management module to turn off the blade server.
The power-control LED can remain on solidly for up to 1 minute after you initiate the power-off process. After you turn off the blade server, wait until the power-control LED is blinking slowly before you initiate the power-on process from the AMM to turn on the blade server again.
Use the management-module Web interface to configure the management module to turn off the blade server if the system is not operating correctly.
For additional information, see the online documentation or the User's Guide for the management module.
Chapter 1. Introduction 7

System-board layouts

Illustrations show the connectors and LEDs on the system board. The illustrations might differ slightly from your hardware.

System-board connectors

Blade server components attach to the connectors on the system board.
Figure 3 shows the connectors on the base unit system board in the blade server.
Figure 3. PS700 system-board connectors
Table 2 shows connector descriptions.
Table 2. PS700 connectors
Callout PS700 blade server connectors
1 Operator panel connector
2 DIMM 1-4 connectors (See Figure 4 on page 9 for individual connectors.) Expansion unit
(SMP) connector
3 Management card connector (P1-C9)
4 SAS hard disk drive connector (P1-D2)
5 Light Path Blue Button
6 SAS hard disk drive (P1-C10)
7 CIOv (1Xe) expansion card connector (P1-C11)
8 High-Speed (CFFh) expansion card connector (P1-C12)
9 DIMM 5-8 connectors (See Figure 4 on page 9 for individual connectors.)
10 3V lithium battery connector (P1-E1)
Figure 4 on page 9 shows individual DIMM connectors.
8 Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)
Figure 4. DIMM connectors. Base unit connectors

System-board LEDs

Use the illustration of the LEDs on the system board to identify a light emitting diode (LED).
Remove the blade server from the BladeCenter unit, open the cover, press the blue button to see any error LEDs that were turned on during error processing, and use Figure 5 to identify the failing component.
Figure 5 shows the locations of LEDs on the system board. Table 3 shows LED descriptions.
Figure 5. LED locations on the system board of the PS700 blade server
Table 3. PS700 LEDs
Callout Base unit LEDs
1 3V lithium battery LED
2 DIMM 1-4 LEDs
3 Management card LED
4 Light path power LED
5 System board LED
6 HDD1 LED
7 Interposer LED
Chapter 1. Introduction 9
Table 3. PS700 LEDs (continued)
Callout Base unit LEDs
8 CIOv (1Xe) expansion card connector LED
9 High-Speed (CFFh) expansion card connector LED
10 HDD2 LED
11 DIMM 5-8 LEDs
10 Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)

Chapter 2. Diagnostics

Use the available diagnostic tools to help solve any problems that might occur in the blade server.
The first and most crucial component of a solid serviceability strategy is the ability to accurately and effectively detect errors when they occur. While not all errors are a threat to system availability, those that go undetected are dangerous because the system does not have the opportunity to evaluate and act if necessary. POWER7 that extend from processor cores and memory to power supplies and hard drives.
POWER7 processor-based systems contain specialized hardware detection circuitry for detecting erroneous hardware operations. Error checking hardware ranges from parity error detection coupled with processor instruction retry and bus retry, to ECC correction on caches and system buses.
IBM hardware error checkers have these distinct attributes:
v Continuous monitoring of system operations to detect potential calculation errors v Attempted isolation of physical faults based on runtime detection of each unique failure v Initiation of a wide variety of recovery mechanisms designed to correct a problem
POWER7 processor-based systems include extensive hardware and firmware recovery logic.
Machine check handling
Machine checks are handled by firmware. When a machine check occurs, the firmware analyzes the error to identify the failing device and creates an error log entry.
®
processor-based systems are specifically designed with error-detection mechanisms
If the system degrades to the point that the service processor cannot reach standby state, the ability to analyze the error does not exist. If the error occurs during POWER PHYP initiates a system reboot.
In partitioned mode, an error that occurs during partition activity is reported to the operating system in the partition.
®
hypervisor (PHYP) activities, the

Diagnostic tools

Tools are available to help you diagnose and solve hardware-related problems.
© Copyright IBM Corp. 2010, 2011 11
v Power-on self-test (POST) progress codes (checkpoints), error codes, and isolation procedures
The POST checks out the hardware at system initialization. IPL diagnostic functions test some system components and interconnections. The POST generates eight-digit checkpoints to mark the progress of powering up the blade server.
Use the management module to view progress codes. The documentation of a progress code includes recovery actions for system hangs. See “POST progress
codes (checkpoints)” on page 84 for more information. If the service processor detects a problem during POST, an error code is logged in the management
module event log. Error codes are also logged in the Linux syslog or AIX
®
diagnostic log, if possible.
See “System reference codes (SRCs)” on page 16. The service processor can generate codes that point to specific isolation procedures. See “Service
processor problems” on page 200.
v Light path diagnostics
Use the light path diagnostic LEDs on the system board to identify failing hardware. If the system error LED on the system LED panel on the front or rear of the BladeCenter unit is lit, one or more error LEDs on the BladeCenter unit components also might be lit.
Light path diagnostics help identify failing customer replaceable unit (CRUs). CRU location codes are included in error codes and the event log.
LED locations
See “System-board LEDs” on page 9.
Front panel
See “Blade server control panel buttons and LEDs” on page 5.
v Troubleshooting tables
Use the troubleshooting tables to find solutions to problems that have identifiable symptoms. See “Troubleshooting tables” on page 191.
v Dump data collection
In some circumstances, an error might require a dump to show more data. The Integrated Virtualization Manager (IVM) or Hardware Management Console (HMC) sets up a dump area. Specific IVM or HMC information is included as part of the information that can optionally be sent to IBM support for analysis.
See “Collecting dump data” on page 13 for more information.
v Stand-alone diagnostics
The AIX-based stand-alone diagnostics CD is in the ship package and is also available from the IBM Web site. Boot the diagnostics from a CD drive or from an AIX network installation manager (NIM) server if the blade server cannot boot to an operating system, no matter which operating system is installed.
Functions provided by the stand-alone diagnostics include: – Analysis of errors reported by platform, such as microprocessor and memory errors – Testing of resources, such as I/O adapters and devices – Service aids, such as firmware update, format disk, and Raid Manager
v Diagnostic utilities for the AIX operating system
Run AIX concurrent diagnostics if AIX is functioning instead of the stand-alone diagnostics. Functions provided by disk-based AIX diagnostics include:
– Automatic error log analysis – Analysis of errors reported by platform, such as microprocessor and memory errors – Testing of resources, such as I/O adapters and devices – Service aids, such as firmware update, format disk, and Raid Manager
v Diagnostic utilities for Linux operating systems
12 Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)
Linux on POWER service and productivity tools include hardware diagnostic aids and productivity tools, and installation aids. The installation aids are provided in the IBM Installation Toolkit for Linux on POWER, a set of tools that aids the installation of Linux on IBM servers with POWER architecture. You can also use the tools to update the PS700 blade server firmware.
Diagnostic utilities for the Linux operating system are available from IBM at https:// www14.software.ibm.com/webapp/set2/sas/f/lopdiags/home.html.
v Diagnostic utilities for other operating systems
You can use the stand-alone diagnostics CD to perform diagnostics on the PS700 blade server, no matter which operating system is loaded on the blade server. However, other supported operating systems might have diagnostic tools that are available through the operating system. See the documentation for your operating system for more information.

Collecting dump data

A dump might be critical for fault isolation when the built-in First Failure Data Capture (FFDC) mechanisms are not capturing sufficient fault data. Even when a fault is identified, dump data can provide additional information that is useful in problem determination.
All hardware state information is part of the dump if a hardware checkstop occurs. When a checkstop occurs, the service processor attempts to dump data that is necessary to analyze the error from appropriate parts of the system.
Note: If you power off the blade through the management module while the service processor is performing a dump, platform dump data is lost.
You might be asked to retrieve a dump to send it to IBM Support for analysis. The location of the dump data varies by operating system.
v Collect an AIX dump from the /var/adm/platform directory. v Collect a Linux dump from the /var/log/dump directory. v Collect an Integrated Virtualization Manager (IVM) dump from the IVM-managed PS700 blade server
through the Manage Dumps task in the IVM console.
v To collect a system dump by using the Hardware Management Console (HMC), complete these steps:
1. Perform a controlled shutdown of all partitions.
Note: A system dump will abnormally terminate any running partitions.
2. In the navigation area, open Systems Management.
3. Select the server and open it.
4. Select Serviceability > Manage Dumps > Action > Initiate System Dump. The dump is
automatically saved to the HMC. For details on how to copy, report, or delete a dump after you have completed a dump, see Managing dumps.
Chapter 2. Diagnostics 13

Location codes

Location codes identify components of the blade server. Location codes are displayed with some error codes to identify the blade server component that is causing the error.
See “System-board connectors” on page 8 for component locations.
Notes:
1. Location codes do not indicate the location of the blade server within the BladeCenter unit. The codes identify components of the blade server only.
2. For checkpoints with no associated location code, see “Light path diagnostics” on page 214 to identify the failing component when there is a hang condition.
3. For checkpoints with location codes, use the following table to identify the failing component when there is a hang condition.
4. For 8-digit codes not listed in Table 4, see the “Checkout procedure” on page 184.
Table 4. Location codes
Components Physical Location Code CRU LED
Un location codes are for enclosure and VPD locations.
Un = Utttt.mmm.sssssss
tttt = system machine type mmm = system model number sssssss = system serial number
DIMM 1 Un-P1-C1 Yes
DIMM 2 Un-P1-C2 Yes
DIMM 3 Un-P1-C3 Yes
DIMM 4 Un-P1-C4 Yes
DIMM 5 Un-P1-C5 Yes
DIMM 6 Un-P1-C6 Yes
DIMM 7 Un-P1-C7 Yes
DIMM 8 Un-P1-C8 Yes
2.5" SAS HDD1 Un-P1-D1 Yes
2.5" SAS HDD2 Un-P1-D2 Yes
Management Card Un-P1-C9 Yes
Battery Un-P1-E1 Yes
PCIe High Speed Expansion Card Un-P1-C12 Yes
1Xe Card Un-P1-C11 Yes
USB Port 1 (CDROM/FDD) Un-P1-T1 No
USB Port 2 (CDROM/FDD) Un-P1-T2 No
SAS controller Un-P1-T3 No
Ethernet HEA0_A Un-P1-T4 No
Ethernet HEA0_B Un-P1-T5 No
Machine Location Code Utttt.mmm.sssssss No
Um codes are for firmware. The format is the same as for a Un location code.
Um = Utttt.mmm.sssssss
14 Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)
Table 4. Location codes (continued)
Components Physical Location Code CRU LED
Firmware version Um-Y1

Reference codes

Reference codes are diagnostic aids that help you determine the source of a hardware or operating system problem. To use reference codes effectively, use them in conjunction with other service and support procedures.
The BladeCenter PS700 Type 8406 blade server produces several types of codes.
Progress codes: The power-on self-test (POST) generates eight-digit status codes that are known as checkpoints or progress codes, which are recorded in the management-module event log. The checkpoints indicate which blade server resource is initializing.
Error codes: The First Failure Data Capture (FFDC) error checkers capture fault data, which the service processor then analyzes. For unrecoverable errors (UEs), for recoverable events that meet or exceed their service thresholds, and for fatal system errors, an unrecoverable checkstop service event triggers the service processor to analyze the error, log the system reference code (SRC), and turn on the system attention LED.
The service processor logs the nine-word, eight-digit per word error code in the BladeCenter management-module event log. Error codes are either system reference codes (SRCs) or service request numbers (SRNs). A location code might also be included.
Isolation procedures: If the fault analysis does not determine a definitive cause, the service processor might indicate a fault isolation procedure that you can use to isolate the failing component.
Viewing the codes
The PS700 blade server does not display checkpoints or error codes on the remote console. The shared BladeCenter unit video also does not display the codes.
If the POST detects a problem, a 9-word, 8-digit error code is logged in the BladeCenter management-module event log. A location code that identifies a component might also be included. See “Error logs” on page 183 for information about viewing the management-module event log.
Service request numbers can be viewed using the AIX diagnostics CD, or various operating system utilities, such as AIX diagnostics or the Linux service aid “diagela”, if it is installed.
Chapter 2. Diagnostics 15

System reference codes (SRCs)

System reference codes indicate a server hardware or software problem that can originate in hardware, in firmware, or in the operating system.
A blade server component generates an error code when it detects a problem. An SRC identifies the component that generated the error code and describes the error. Use the SRC information to identify a list of possibly failing items and to find information about any additional isolation procedures.
The following table shows the syntax of a nine-word B700xxxx SRC as it might be displayed in the event log of the management module.
The first word of the SRC in this example is the message identifier, B7001111. This example numbers each word after the first word to show relative word positions. The seventh word is the direct select address, which is 77777777 in the example.
Table 5. Nine-word system reference code in the management-module event log
Index Sev Source Date/Time Text
1 E Blade_05
01/21/2008, 17:15:14
Depending on your operating system and the utilities you have installed, error messages might also be stored in an operating system log. See the documentation that comes with the operating system for more information.
(PS700-BC1BLD5E) SYS F/W: Error. Replace UNKNOWN (5008FECF B7001111 22222222 33333333 44444444 55555555 66666666 77777777 88888888 99999999)
The management module can display the most recent 32 SRCs and time stamps. Manually refresh the list to update it.
Select Blade Service Data > blade_name in the management module to see a list of the 32 most recent SRCs.
Table 6. Management module reference code listing
Unique ID System Reference Code Timestamp
00040001 D1513901 2005-11-13 19:30:20
00000016 D1513801 2005-11-13 19:30:16
Any message with more detail is highlighted as a link in the System Reference Code column. Click the message to cause the management module to present the additional message detail:
D1513901 Created at: 2007-11-13 19:30:20 SRC Version: 0x02 Hex Words 2-5: 020110F0 52298910 C1472000 200000FF
16 Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)
SRC formats
SRCs are strings of either six or eight alphanumeric characters. The first two characters designate the reference code type.
The first character indicates the type of error. In a few cases, the first two characters indicate the type of error:
v 1xxxxxxx - System power control network (SPCN) error v 6xxxxxxx - Virtual optical device error v A1xxxxxx - Attention required (Service processor) v AAxxxxxx - Attention required (Partition firmware) v B1xxxxxx - Service processor error, such as a boot problem v B6xxxxxx - Licensed Internal Code or hardware event error v B9xxxxxx - Software installation error or IBM i IPL error. See "Recovering from IPL or system failures"
in the IBM i Information Center at http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/ index.jsp?topic=/ipha5_p5/iplprocedure.htm.
v BAxxxxxx - Partition firmware error v Cxxxxxxx - Checkpoint (must hang to indicate an error) v Dxxxxxxx - Dump checkpoint (must hang to indicate an error)
To find a description of a SRC that is not listed in this PS700 blade server documentation, refer to the POWER7 Reference Code Lookup page at http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/ index.jsp?topic=/ipha8/codefinder.htm.
1xxxyyyy SRCs
The 1xxxyyyy system reference codes are system power control network (SPCN) reference codes.
Look for the rightmost 4 characters (yyyy in 1xxxyyyy) in the error code; this is the reference code. Find the reference code in Table 7.
Perform all actions before exchanging failing items.
Table 7. 1xxxyyyy SRCs
v Follow the suggested actions in the order in which they are listed in the Action column until the problem is
solved. If an action solves the problem, then you can stop performing the remaining actions.
v See Chapter 3, “Parts listing, Type 8406,” on page 229 to determine which components are CRUs and which
components are FRUs.
1xxxyyyy Error Codes
00AC Informational message: AC loss
00AD Informational message: A
1F02 Informational message: The
1F03 Informational message: Invalid
Description Action
No action is required.
was reported
No action is required. service processor reset caused the blade server to power off
No action is required. trace logs reached 1K of data.
No action is required. TMS of location code.
Chapter 2. Diagnostics 17
Table 7. 1xxxyyyy SRCs (continued)
v Follow the suggested actions in the order in which they are listed in the Action column until the problem is
solved. If an action solves the problem, then you can stop performing the remaining actions.
v See Chapter 3, “Parts listing, Type 8406,” on page 229 to determine which components are CRUs and which
components are FRUs.
1xxxyyyy Error Codes
2600 Power good (pGood) master
2610 pGood fault
2620 12V dc pGood input fault
2629 1.5V reg_pgood fault
262B 1.8V reg_pgood fault
262C 5V reg_pgood fault
262D 3.3V reg_pgood fault
262E 2.5V reg_pgood fault
2630 VRM CP0 core pGood fault
2632 VRM CP0 cache pGood fault
2647 12V "or-ing" FET short
2648 Blade power latch fault
Description Action
fault
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
1. Go to “Checkout procedure” on page 184.
2. Replace the system-board, as described in “Replacing the FRU
system-board and chassis assembly” on page 260.
18 Power Systems: Problem Determination and Service Guide for the IBM Power PS700 (8406-70Y)
Loading...
+ 264 hidden pages