IBM Power System, Power System 9006-22P, Power System 5104-22C, Power System 9006-12P, Power System 9006-22C Problem Analysis, System Parts, And Locations
Problem analysis, system parts, and
locations for the 5104-22C, 9006-12P,
9006-22C, and 9006-22P
IBM
Note
Before using this information and the product it supports, read the information in “Safety notices” on
page v, “Notices” on page 109, the IBM Systems Safety Notices manual, G229-9054, and the IBMEnvironmental Notices and User Guide, Z125–5823.
This edition applies to IBM® Power Systems servers that contain the POWER9™ processor and to all associated models.
Verifying a repair........................................................................................................................................60
Class A Notices...................................................................................................................................112
Class B Notices...................................................................................................................................115
Terms and conditions.............................................................................................................................. 117
iv
Safety notices
Safety notices may be printed throughout this guide:
• DANGER notices call attention to a situation that is potentially lethal or extremely hazardous to people.
• CAUTION notices call attention to a situation that is potentially hazardous to people because of some
existing condition.
• Attention notices call attention to the possibility of damage to a program, device, system, or data.
World Trade safety information
Several countries require the safety information contained in product publications to be presented in their
national languages. If this requirement applies to your country, safety information documentation is
included in the publications package (such as in printed documentation, on DVD, or as part of the product)
shipped with the product. The documentation contains the safety information in your national language
with references to the U.S. English source. Before using a U.S. English publication to install, operate, or
service this product, you must rst become familiar with the related safety information documentation.
You should also refer to the safety information documentation any time you do not clearly understand any
safety information in the U.S. English publications.
Replacement or additional copies of safety information documentation can be obtained by calling the IBM
Hotline at 1-800-300-8751.
German safety information
Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne § 2 der
Bildschirmarbeitsverordnung geeignet.
Laser safety information
IBM servers can use I/O cards or features that are ber-optic based and that utilize lasers or LEDs.
Laser compliance
IBM servers may be installed inside or outside of an IT equipment rack.
DANGER:
Electrical voltage and current from power, telephone, and communication cables are hazardous.
To avoid a shock hazard:
• If IBM supplied the power cord(s), connect power to this unit only with the IBM provided power
cord. Do not use the IBM provided power cord for any other product.
• Do not open or service any power supply assembly.
• Do not connect or disconnect any cables or perform installation, maintenance, or reconguration
of this product during an electrical storm.
• The product might be equipped with multiple power cords. To remove all hazardous voltages,
disconnect all power cords.
– For AC power, disconnect all power cords from their AC power source.
– For racks with a DC power distribution panel (PDP), disconnect the customer’s DC power
• When connecting power to the product ensure all power cables are properly connected.
When working on or around the system, observe the following precautions:
source to the PDP.
– For racks with AC power, connect all power cords to a properly wired and grounded electrical
outlet. Ensure that the outlet supplies proper voltage and phase rotation according to the
system rating plate.
– For racks with a DC power distribution panel (PDP), connect the customer’s DC power source
to the PDP. Ensure that the proper polarity is used when attaching the DC power and DC power
return wiring.
• Connect any equipment that will be attached to this product to properly wired outlets.
• When possible, use one hand only to connect or disconnect signal cables.
• Never turn on any equipment when there is evidence of re, water, or structural damage.
• Do not attempt to switch on power to the machine until all possible unsafe conditions are
corrected.
• Assume that an electrical safety hazard is present. Perform all continuity, grounding, and power
checks specied during the subsystem installation procedures to ensure that the machine meets
safety requirements.
• Do not continue with the inspection if any unsafe conditions are present.
• Before you open the device covers, unless instructed otherwise in the installation and
conguration procedures: Disconnect the attached AC power cords, turn off the applicable
circuit breakers located in the rack power distribution panel (PDP), and disconnect any
telecommunications systems, networks, and modems.
DANGER:
• Connect and disconnect cables as described in the following procedures when installing,
moving, or opening covers on this product or attached devices.
To Disconnect:
1. Turn off everything (unless instructed otherwise).
2. For AC power, remove the power cords from the outlets.
3. For racks with a DC power distribution panel (PDP), turn off the circuit breakers located in the
PDP and remove the power from the Customer's DC power source.
4. Remove the signal cables from the connectors.
5. Remove all cables from the devices.
To Connect:
1. Turn off everything (unless instructed otherwise).
2. Attach all cables to the devices.
3. Attach the signal cables to the connectors.
4. For AC power, attach the power cords to the outlets.
5. For racks with a DC power distribution panel (PDP), restore the power from the Customer's
DC power source and turn on the circuit breakers located in the PDP.
6. Turn on the devices.
Sharp edges, corners and joints may be present in and around the system. Use care when
handling equipment to avoid cuts, scrapes and pinching. (D005)
(R001 part 1 of 2):
DANGER:
• Heavy equipment–personal injury or equipment damage might result if mishandled.
• Always lower the leveling pads on the rack cabinet.
• Always install stabilizer brackets on the rack cabinet unless the earthquake option is to be
installed.
• To avoid hazardous conditions due to uneven mechanical loading, always install the heaviest
devices in the bottom of the rack cabinet. Always install servers and optional devices starting
from the bottom of the rack cabinet.
Observe the following precautions when working on or around your IT rack system:
vi Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
• Rack-mounted devices are not to be used as shelves or work spaces. Do not place objects on top
of rack-mounted devices. In addition, do not lean on rack mounted devices and do not use them
to stabilize your body position (for example, when working from a ladder).
• Stability hazard:
– The rack may tip over causing serious personal injury.
– Before extending the rack to the installation position, read the installation instructions.
– Do not put any load on the slide-rail mounted equipment mounted in the installation position.
– Do not leave the slide-rail mounted equipment in the installation position.
• Each rack cabinet might have more than one power cord.
– For AC powered racks, be sure to disconnect all power cords in the rack cabinet when directed
to disconnect power during servicing.
– For racks with a DC power distribution panel (PDP), turn off the circuit breaker that controls
the power to the system unit(s), or disconnect the customer’s DC power source, when
directed to disconnect power during servicing.
• Connect all devices installed in a rack cabinet to power devices installed in the same rack
cabinet. Do not plug a power cord from a device installed in one rack cabinet into a power device
installed in a different rack cabinet.
• An electrical outlet that is not correctly wired could place hazardous voltage on the metal parts
of the system or the devices that attach to the system. It is the responsibility of the customer to
ensure that the outlet is correctly wired and grounded to prevent an electrical shock. (R001 part
1 of 2)
(R001 part 2 of 2):
CAUTION:
• Do not install a unit in a rack where the internal rack ambient temperatures will exceed the
manufacturer's recommended ambient temperature for all your rack-mounted devices.
• Do not install a unit in a rack where the air flow is compromised. Ensure that air flow is not
blocked or reduced on any side, front, or back of a unit used for air flow through the unit.
• Consideration should be given to the connection of the equipment to the supply circuit so that
overloading of the circuits does not compromise the supply wiring or overcurrent protection. To
provide the correct power connection to a rack, refer to the rating labels located on the
equipment in the rack to determine the total power requirement of the supply circuit.
• (For sliding drawers.) Do not pull out or install any drawer or feature if the rack stabilizer
brackets are not attached to the rack or if the rack is not bolted to the floor. Do not pull out more
than one drawer at a time. The rack might become unstable if you pull out more than one drawer
at a time.
• (For xed drawers.) This drawer is a xed drawer and must not be moved for servicing unless
specied by the manufacturer. Attempting to move the drawer partially or completely out of the
rack might cause the rack to become unstable or cause the drawer to fall out of the rack. (R001
part 2 of 2)
Safety notices
vii
CAUTION: Removing components from the upper positions in the rack cabinet improves rack
stability during relocation. Follow these general guidelines whenever you relocate a populated
rack cabinet within a room or building.
• Reduce the weight of the rack cabinet by removing equipment starting at the top of the rack
cabinet. When possible, restore the rack cabinet to the conguration of the rack cabinet as you
received it. If this conguration is not known, you must observe the following precautions:
– Remove all devices in the 32U position (compliance ID RACK-001 or 22U (compliance ID
RR001) and above.
– Ensure that the heaviest devices are installed in the bottom of the rack cabinet.
– Ensure that there are little-to-no empty U-levels between devices installed in the rack cabinet
below the 32U (compliance ID RACK-001 or 22U (compliance ID RR001) level, unless the
received congurationspecically allowed it.
• If the rack cabinet you are relocating is part of a suite of rack cabinets, detach the rack cabinet
from the suite.
• If the rack cabinet you are relocating was supplied with removable outriggers they must be
reinstalled before the cabinet is relocated.
• Inspect the route that you plan to take to eliminate potential hazards.
• Verify that the route that you choose can support the weight of the loaded rack cabinet. Refer to
the documentation that comes with your rack cabinet for the weight of a loaded rack cabinet.
• Verify that all door openings are at least 760 x 230 mm (30 x 80 in.).
• Ensure that all devices, shelves, drawers, doors, and cables are secure.
• Ensure that the four leveling pads are raised to their highest position.
• Ensure that there is no stabilizer bracket installed on the rack cabinet during movement.
• Do not use a ramp inclined at more than 10 degrees.
• When the rack cabinet is in the new location, complete the following steps:
(L001)
(L002)
– Lower the four leveling pads.
– Install stabilizer brackets on the rack cabinet or in an earthquake environment bolt the rack to
the floor.
– If you removed any devices from the rack cabinet, repopulate the rack cabinet from the
lowest position to the highest position.
• If a long-distance relocation is required, restore the rack cabinet to the conguration of the rack
cabinet as you received it. Pack the rack cabinet in the original packaging material, or equivalent.
Also lower the leveling pads to raise the casters off of the pallet and bolt the rack cabinet to the
pallet.
(R002)
DANGER:
this label attached. Do not open any cover or barrier that contains this label. (L001)
Hazardous voltage, current, or energy levels are present inside any component that has
viii
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
(L003)
or
DANGER: Rack-mounted devices are not to be used as shelves or work spaces. Do not place
objects on top of rack-mounted devices. In addition, do not lean on rack-mounted devices and do
not use them to stabilize your body position (for example, when working from a ladder). Stability
hazard:
• The rack may tip over causing serious personal injury.
• Before extending the rack to the installation position, read the installation instructions.
• Do not put any load on the slide-rail mounted equipment mounted in the installation position.
• Do not leave the slide-rail mounted equipment in the installation position.
(L002)
or
or
Safety notices
ix
or
DANGER: Multiple power cords. The product might be equipped with multiple AC power cords or
multiple DC power cables. To remove all hazardous voltages, disconnect all power cords and
power cables. (L003)
(L007)
CAUTION:
x Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
A hot surface nearby. (L007)
(L008)
CAUTION: Hazardous moving parts nearby. (L008)
All lasers are certied in the U.S. to conform to the requirements of DHHS 21 CFR Subchapter J for class 1
laser products. Outside the U.S., they are certied to be in compliance with IEC 60825 as a class 1 laser
product. Consult the label on each part for laser certication numbers and approval information.
CAUTION: This product might contain one or more of the following devices: CD-ROM drive, DVDROM drive, DVD-RAM drive, or laser module, which are Class 1 laser products. Note the following
information:
• Do not remove the covers. Removing the covers of the laser product could result in exposure to
hazardous laser radiation. There are no serviceable parts inside the device.
• Use of the controls or adjustments or performance of procedures other than those specied
herein might result in hazardous radiation exposure.
(C026)
CAUTION: Data processing environments can contain equipment transmitting on system links
with laser modules that operate at greater than Class 1 power levels. For this reason, never look
into the end of an optical ber cable or open receptacle. Although shining light into one end and
looking into the other end of a disconnected optical ber to verify the continuity of optic bers may
not injure the eye, this procedure is potentially dangerous. Therefore, verifying the continuity of
optical bers by shining light into one end and looking at the other end is not recommended. To
verify continuity of a ber optic cable, use an optical light source and power meter. (C027)
CAUTION: This product contains a Class 1M laser. Do not view directly with optical instruments.
(C028)
CAUTION: Some laser products contain an embedded Class 3A or Class 3B laser diode. Note the
following information:
• Laser radiation when open.
• Do not stare into the beam, do not view directly with optical instruments, and avoid direct
exposure to the beam. (C030)
(C030)
CAUTION: The battery contains lithium. To avoid possible explosion, do not burn or charge the
battery.
Do Not:
• Throw or immerse into water
• Heat to more than 100 degrees C (212 degrees F)
• Repair or disassemble
Exchange only with the IBM-approved part. Recycle or discard the battery as instructed by local
regulations. In the United States, IBM has a process for the collection of this battery. For
information, call 1-800-426-4333. Have the IBM part number for the battery unit available when
you call. (C003)
CAUTION: Regarding IBM provided VENDOR LIFT TOOL:
• Operation of LIFT TOOL by authorized personnel only.
Safety notices xi
• LIFT TOOL intended for use to assist, lift, install, remove units (load) up into rack elevations. It is
not to be used loaded transporting over major ramps nor as a replacement for such designated
tools like pallet jacks, walkies, fork trucks and such related relocation practices. When this is not
practicable, specially trained persons or services must be used (for instance, riggers or movers).
• Read and completely understand the contents of LIFT TOOL operator's manual before using.
Failure to read, understand, obey safety rules, and follow instructions may result in property
damage and/or personal injury. If there are questions, contact the vendor's service and support.
Local paper manual must remain with machine in provided storage sleeve area. Latest revision
manual available on vendor's web site.
• Test verify stabilizer brake function before each use. Do not over-force moving or rolling the LIFT
TOOL with stabilizer brake engaged.
• Do not raise, lower or slide platform load shelf unless stabilizer (brake pedal jack) is fully
engaged. Keep stabilizer brake engaged when not in use or motion.
• Do not move LIFT TOOL while platform is raised, except for minor positioning.
• Do not exceed rated load capacity. See LOAD CAPACITY CHART regarding maximum loads at
center versus edge of extended platform.
• Only raise load if properly centered on platform. Do not place more than 200 lb (91 kg) on edge
of sliding platform shelf also considering the load's center of mass/gravity (CoG).
• Do not corner load the platforms, tilt riser, angled unit install wedge or other such accessory
options. Secure such platforms -- riser tilt, wedge, etc options to main lift shelf or forks in all four
(4x or all other provisioned mounting) locations with provided hardware only, prior to use. Load
objects are designed to slide on/off smooth platforms without appreciable force, so take care
not to push or lean. Keep riser tilt [adjustable angling platform] option flat at all times except for
nal minor angle adjustment when needed.
• Do not stand under overhanging load.
• Do not use on uneven surface, incline or decline (major ramps).
• Do not stack loads.
• Do not operate while under the influence of drugs or alcohol.
• Do not support ladder against LIFT TOOL (unless the specic allowance is provided for one
following qualied procedures for working at elevations with this TOOL).
• Tipping hazard. Do not push or lean against load with raised platform.
• Do not use as a personnel lifting platform or step. No riders.
• Do not stand on any part of lift. Not a step.
• Do not climb on mast.
• Do not operate a damaged or malfunctioning LIFT TOOL machine.
• Crush and pinch point hazard below platform. Only lower load in areas clear of personnel and
obstructions. Keep hands and feet clear during operation.
• No Forks. Never lift or move bare LIFT TOOL MACHINE with pallet truck, jack or fork lift.
• Mast extends higher than platform. Be aware of ceiling height, cable trays, sprinklers, lights, and
other overhead objects.
• Do not leave LIFT TOOL machine unattended with an elevated load.
• Watch and keep hands, ngers, and clothing clear when equipment is in motion.
• Turn Winch with hand power only. If winch handle cannot be cranked easily with one hand, it is
probably over-loaded. Do not continue to turn winch past top or bottom of platform travel.
Excessive unwinding will detach handle and damage cable. Always hold handle when lowering,
unwinding. Always assure self that winch is holding load before releasing winch handle.
• A winch accident could cause serious injury. Not for moving humans. Make certain clicking sound
is heard as the equipment is being raised. Be sure winch is locked in position before releasing
handle. Read instruction page before operating this winch. Never allow winch to unwind freely.
xii
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Freewheeling will cause uneven cable wrapping around winch drum, damage cable, and may
cause serious injury.
• This TOOL must be maintained correctly for IBM Service personnel to use it. IBM shall inspect
condition and verify maintenance history before operation. Personnel reserve the right not to use
TOOL if inadequate. (C048)
Power and cabling information for NEBS (Network Equipment-Building System) GR-1089-CORE
The following comments apply to the IBM servers that have been designated as conforming to NEBS
(Network Equipment-Building System) GR-1089-CORE:
The equipment is suitable for installation in the following:
• Network telecommunications facilities
• Locations where the NEC (National Electrical Code) applies
The intrabuilding ports of this equipment are suitable for connection to intrabuilding or unexposed wiring
or cabling only. The intrabuilding ports of this equipment must not be metallically connected to the
interfaces that connect to the OSP (outside plant) or its wiring. These interfaces are designed for use as
intrabuilding interfaces only (Type 2 or Type 4 ports as described in GR-1089-CORE) and require isolation
from the exposed OSP cabling. The addition of primary protectors is not sufcient protection to connect
these interfaces metallically to OSP wiring.
Note: All Ethernet cables must be shielded and grounded at both ends.
The ac-powered system does not require the use of an external surge protection device (SPD).
The dc-powered system employs an isolated DC return (DC-I) design. The DC battery return terminal shallnot be connected to the chassis or frame ground.
The dc-powered system is intended to be installed in a common bonding network (CBN) as described in
GR-1089-CORE.
Safety notices
xiii
xiv Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Beginning troubleshooting and problem analysis
This information provides a starting point for analyzing problems.
This information is the starting point for diagnosing and repairing systems. From this point, you are guided
to the appropriate information to help you diagnose problems, determine the appropriate repair action,
and then complete the necessary steps to repair the system.
Note: Update the system rmware to the latest level before you start problem analysis. If you update the
system rmware, you will have the latest available xes and improvements for error handling, reporting,
and isolation. For instructions about updating the system rmware, see Getting xes.
What type of problem are you dealing with?Problem analysis procedure
You do not know the type of problem.Go to “Determining the problem analysis
procedure to perform” on page 1.
A baseboard management controller (BMC) access
problem occurred.
The system does not power on (the power button
or the BMC power on command does not power on
the system).
A system rmware boot failure occurred (the
system started but was not able to boot to the
Petitboot menu).
A video graphics array (VGA) monitor problem
occurred (the system started but no video is
displayed on the monitor).
An operating system boot failure occurred (the
system booted to the Petitboot menu but the
operating system did not start).
A sensor on the sensor readings GUI display is red. Go to “Resolving a sensor indicator problem” on
A processor, memory, power, or cooling hardware
failure occurred.
Missing or faulty PCIe adapter or device.Go to Resolving a PCIe adapter or device problem.
You have an FQPSPxxxxxxx event code.Go to FQPSPxxxxxxx Event Codes.
Go to “Resolving a BMC access problem” on page
2.
Go to “Resolving a power problem” on page 5.
Go to “Resolving a system rmware boot failure”
on page 5.
Go to “Resolving a VGA monitor problem” on page
7.
Go to “Resolving an operating system boot failure”
on page 7.
page 9.
Go to “Resolving a hardware problem” on page
10.
Determining the problem analysis procedure to perform
Learn how to identify the correct problem analysis procedure to perform.
About this task
To determine the correct problem analysis procedure to perform, complete the following steps:
Procedure
1. After you apply power to the system, are the power supply LEDs green (either steady or flashing)?
2. Can you access the baseboard management controller (BMC) across the network?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving a BMC access problem” on page 2.
3. Can you boot the system to the Petitboot menu?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving a system rmware boot failure” on page 5.
4. Is video displayed on the video graphics array (VGA) monitor?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving a VGA monitor problem” on page 7.
5. Can you start the operating system?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving an operating system boot failure” on page 7.
6. On the sensor readings GUI display, are any sensors red?
If
Yes:Go to “Resolving a sensor indicator problem” on page 9.
No:Continue with the next step.
7. Go to “Resolving a hardware problem” on page 10. This ends the procedure.
Then
Resolving a BMC access problem
Learn how to identify the service action that is needed to resolve a baseboard management controller
(BMC) access problem.
Procedure
1. Ensure that the BMC password is not set to the default password. For information about changing the
default password, see Logging on to the BMC GUI. Does the problem persist?
If
Yes:Continue with the next step.
No:This ends the procedure.
Then
2. Are both ends of the network cable seated securely?
If
Yes:Continue with the next step.
No:Seat both ends of the cable securely. If the problem persists, continue with the next
2 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Then
step.
3. Power off the system and disconnect all AC power cords for 30 seconds. Then, reconnect the AC
power cords and power on the system. Does the BMC access problem persist?
IfThen
Yes:Continue with the next step.
No:This ends the procedure.
4. Verify that the BMC network settings are correct.
a) Power on the system by using the power button on the front of the system. Wait 1 - 2 minutes for
the system to display the Petitboot menu.
b) When the Petitboot menu is displayed, press any key to interrupt the boot process. Then, select
Exit to Shell.
c) Type the following command and press Enter:
ipmitool lan print 1
d) Verify that the MAC address and the IP address settings are correct. Then, continue with the next
step.
Note: If the IP address setting is incorrect, go to Conguring the rmware IP address
website (http://www.ibm.com/support/knowledgecenter/linuxonibm/liabw/
liabwenablenetwork.htm). If the MAC address is 00:00:00:00:00:00, go to “Contacting IBM
service and support” on page 61.
5. Are you able to log in to the BMC web interface?
If
Then
Yes:To update the BMC rmware, go to Updating the system rmware by using the BMC.
If the problem persists, go to step “12” on page 4.
No:Continue with the next step.
6. Complete the following steps:
a. Connect a VGA monitor to the system.
b. Press the power button to power on the system.
c. Boot the system to the Petitboot menu. From the Petitboot menu, select Exit to shell.
7. Are you mounting the storage that contains the pUpdate utility and the BMC rmwarele from a
network storage location?
If
Yes:Continue with the next step.
No:Go to step “9” on page 4.
8. To update the BMC rmware by using a network storage location, complete the following steps:
a) Type mkdir /tmp/media and press Enter.
b) Type the following command and press Enter:
mount -t nfs xxx.xxx.xx.xx:/path/of/files /tmp/media, where xxx.xxx.xx.xx is the
IP address of the system to which you want to establish the connection.
c) Type cd /tmp/media and press Enter.
d) To update the BMC rmware, type the following command and press Enter:
Then
./pUpdate -f bmc.bin -i bt, where bmc.bin is the name of the BMC image le.
e) Allow at least 2 minutes for the BMC to reboot. Does the problem persist?
If
Yes:Go to step “12” on page 4.
Then
Beginning troubleshooting and problem analysis 3
IfThen
No:This ends the procedure.
9. Update the BMC rmware by using a USB device. Complete the following steps:
a) Ensure that the USB device is formatted by using the VFATle system.
b) Insert the USB device into the system if you have not already done so.
c) Type mount and press Enter.
Is the following output displayed?
/dev/mapper/sdb1 mounted on /var/petitboot/mnt/dev/sdb1
IfThen
Yes:Continue with the next step.
No:Go to step “11” on page 4.
10. Complete the following steps:
a) Type cd /var/petitboot/mnt/dev/sdb1 and press Enter.
b) To update the BMC rmware, type the following command and press Enter:
./pUpdate -f bmc.bin -i bt, where bmc.bin is the name of the BMC image le.
c) Allow at least 2 minutes for the BMC to reboot. Does the problem persist?
If
Yes:Go to step “12” on page 4.
No:This ends the procedure.
11. Complete the following steps:
a) Type mkdir /tmp/media and press Enter.
b) Type mount /dev/mapper/sdb1 /tmp/media and press Enter.
c) Type cd /tmp/media and press Enter.
d) To update the BMC rmware, type the following command and press Enter:
./pUpdate -f bmc.bin -i bt, where bmc.bin is the name of the BMC image le.
e) Allow at least 2 minutes for the BMC to reboot. Does the problem persist?
If
Yes:Go to step “12” on page 4.
No:This ends the procedure.
12. Replace the system backplane.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page 63
to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
Then
Then
This ends the procedure.
4
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Resolving a power problem
Learn how to identify the service action that is needed to resolve a power problem.
Procedure
1. Is the identify LED on the front of the system flashing red slowly at 0.25 Hz? For more information
about LEDs, see LEDs on the 9006-12P system or LEDs on the 5104-22C, 9006-22C, or 9006-22P
system.
IfThen
Yes:Continue with the next step.
No:No service action is required. This ends the procedure.
2. Perform the following actions, one at a time until the problem is resolved:
a. Ensure that all of the power cords are fully seated in the power supplies.
b. Ensure that the power supply is fully seated in the system.
c. Ensure that the power supply fan is not blocked.
d. Ensure that all of the power cords are fully seated in the power distribution units (PDUs) or wall
outlets.
e. If the power cords are plugged into PDUs, ensure that the PDUs are turned on.
f. Replace the power cords.
g. Replace the power supplies.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.
Resolving a system rmware boot failure
Learn how to identify the service action that is needed to resolve a failure while booting your system
rmware.
Procedure
1. Does the baseboard management controller (BMC) respond to commands and are you able to access
the BMC web interface?
Note: To determine whether the BMC responds to commands, run the following ipmitool command:
ipmitool -I lanplus -U <username> -P <password> -H <bmc ip or bmc hostname> chassis status
If
Then
Yes:Continue with step “3” on page 6.
No:Continue with the next step.
2. Complete the following actions, one at a time, until the problem is resolved:
a. Reset the BMC remotely by entering the following command:
Beginning troubleshooting and problem analysis
5
ipmitool -I lanplus -U <username> -P <password> -H <bmc ip or bmc hostname> mc reset cold
b. Disconnect the power cords from the system for 30 seconds. Reconnect the power cords, wait 5
minutes, and then go to step “1” on page 5.
c. Update the BMC rmware by using the pUpdate command with the block transfer (BT) option:
1) Type mkdir /tmp/media and press Enter.
2) Type the following command and press Enter:
mount -t nfs xxx.xxx.xx.xx:/path/of/files /tmp/media, where xxx.xxx.xx.xx is
the IP address of the system to which you want to establish the connection.
3) Type cd /tmp/media and press Enter.
4) To update the BMC rmware, type the following command and press Enter:
./pUpdate -f bmc.bin -i bt, where bmc.bin is the name of the BMC image le.
5) Allow at least 2 minutes for the BMC to reboot.
d. Replace the system backplane.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.
3. After you pressed the power button, did the system turn on but fail to display the Petitboot menu?
If
Then
Yes:Continue with the next step.
No:This ends the procedure.
4. Complete the following actions, one at a time, until the problem is resolved:
a. Ensure that the TPM card is fully seated.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location.
b. Disconnect the power cords from the system for 30 seconds. Reconnect the power cords, wait 5
minutes, and then go to step “3” on page 6.
c. Update the PNOR rmware. For instructions, see Getting xes.
Note: If your system is a 9006-12P or 9006-22P, the PNOR rmware level must be
V2.12-20190404, or later.
d. Replace the system backplane.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
6
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
This ends the procedure.
Resolving a VGA monitor problem
Learn how to identify the service action that is needed to resolve a video graphics array (VGA) monitor
problem.
Procedure
1. Is the system powered on and is the VGA monitor connected to the VGA display port, but no video is
displayed?
IfThen
Yes:Continue with the next step.
No:This ends the procedure.
2. Complete the following steps, one at a time until the problem is resolved:
a) Ensure that the VGA cable is properly seated to the server port and to the monitor port.
b) Verify that your monitor and your VGA cable are working properly by testing them on a system that
is known to be working properly. If the monitor or the VGA cable does not work properly, replace it.
c) Verify that the system is powered on by activating a serial over LAN (SOL) session through the
baseboard management controller (BMC). If the system is not active, go to “Resolving a system
rmware boot failure” on page 5.
d) Replace the system backplane.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.
Resolving an operating system boot failure
Learn how to identify the service action that is needed to resolve a failure while booting your operating
system.
Procedure
1. Was the system recently installed, serviced, moved, or upgraded?
If
Yes:Ensure that all cables are properly seated in the connection path to the designated
No:Continue with the next step.
2. Are you booting the operating system from a network location?
If
Then
boot device. This ends the procedure.
Then
Yes:Continue with the next step.
No:Continue with step “4” on page 8.
3. Complete the following actions, one at a time until the problem is resolved:
Beginning troubleshooting and problem analysis
7
a. Ensure that a problem does not exist with the connection to the network location.
b. Ensure that the adapter has a valid IP address for the network.
c. Replace the network adapter.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
4. Petitboot displays all recognized bootable images to use by default. Is the boot image recognized by
Petitboot?
IfThen
Yes:Continue with step “10” on page 9.
No:Select the Petitboot menu option to refresh the boot images. If the problem persists,
continue with the next step.
5. To determine the command to type on the Petitboot command line to verify that the boot drive is
recognized and in optimal status, use Table 1 on page 8.
Table 1. Determine the command to verify that the boot drive is recognized and in optimal status
Boot drive congurationCommands
Virtual drive connected directly to the system
backplane
Physical drive connected directly to the system
backplane
Is the boot drive recognized and in optimal status?
If
Yes:Reinstall the operating system on the boot drive. This ends the procedure.
No:Continue with the next step.
6. Are the drives properly seated in their respective drive bays?
Note:
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page 63
to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
If
Yes:Continue with the next step.
Then
Then
arcconf getconfig 1 LD
arcconf getconfig 1 PD
No:Properly seat the drives in the drive bays. Then, go to step “4” on page 8.
7. Refresh the Petitboot boot options. Is the boot image on the boot drive recognized?
If
Yes:Boot the operating system. Then, continue with step “10” on page 9.
No:Continue with the next step.
8 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Then
8. To determine the command to type on the Petitboot command line to verify that the drives that are
known to be in a RAID array are recognized, use Table 2 on page 9.
Table 2. Determine the command to verify that the drives that are known to be in a RAID array are
recognized
Drive congurationCommands
Drive connected directly to the system
backplane
Are the drives that are known to be in the RAID array recognized?
IfThen
Yes:Reinstall the operating system on the boot drive. This ends the procedure.
No:Continue with the next step.
9. Complete the following actions, one at a time until the physical drives are recognized in the RAID
array:
Note:
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page 63
to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
a. If the drive is connected directly to the system backplane, ensure that the mini-SAS cable and
SATA cables are properly seated in the disk drive backplane and system backplane.
b. Replace the SAS or SATA cable.
c. If the drive is connected directly to the system backplane, replace the system backplane.
• arcconf getconfig 1 LD
• arcconf getconfig 1 PD
This ends the procedure.
10. Does an operating system error occur during the boot?
If
Yes:Recover the operating system with the tools for the operating system. If that does
No:Reinstall the operating system. This ends the procedure.
Then
not resolve the problem, reinstall the operating system. This ends the procedure.
Resolving a sensor indicator problem
Learn how to resolve a sensor indicator problem.
About this task
To determine whether a service action is required, complete the following procedure:
Note: For more information about sensors, see Sensor readings GUI display.
Procedure
1. If the system is not powered on, boot the system to the operational state. Log in to the BMC web
interface. Then, click Server Health > Sensor Readings.
Are any of the sensor indicator LEDs red?
Beginning troubleshooting and problem analysis
9
• Yes: Continue with the next step.
• No:This ends the procedure.
2. Record the names of any sensors that have a red LED indicator status.
Note: Repeat steps 3 - 6 for every sensor that you record in this step.
3. Use one of the following commands to list the sensor event logs (SELs).
• To list SELs by using an in-band network, enter the following command:
ipmitool sel elist
• To list SELs remotely over the LAN, enter the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel
elist
4. Review the list of SELs and locate the log entry that meets the following criteria:
• The name of any of the sensors you recorded in step 2
.
• A service action keyword is present. For a list of service action keywords, see “Identifying service
action keywords in system event logs” on page 23.
• Asserted is in the description.
Did you identify a log entry that meets the above criteria?
• Yes: Continue with the next step.
• No: Go to “Collecting diagnostic data” on page 60. Then, go to “Contacting IBM service and
support” on page 61. This ends the procedure.
5. Use one of the following options to display the SEL details for the sensor:
Note: You must specify the SEL record ID in hexadecimal format. For example: 0x1a.
• To display SEL details by using an in-band network, enter the following command:
ipmitool sel get <SEL record ID>
• To display SEL details remotely over the LAN, enter the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel get <SEL record ID>
6. The sensor ID eld contains sensor information in the sensor name (sensor ID) format. Record the
sensor name, sensor ID, and event description. Then, use this information to determine the service
action to perform:
• If your system is a 5104-22C, 9006-12P, 9006-22C, or 9006-22P, go to “Identifying a service action
by using sensor and event information for the 5104-22C, 9006-12P, 9006-22C, or 9006-22P” on
page 24 to determine the service action to perform. This ends the procedure.
Resolving a hardware problem
Learn how to identify the service action that is needed to resolve a hardware problem.
Procedure
1. If you have not already done so, manually boot the system.
2. Go to “Identifying a service action by using system event logs” on page 18. Then, continue with the
next step.
3. Was a service action identied?
If
Yes:Continue with the next step.
10 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Then
IfThen
No:Go to step “5” on page 11.
4. Did the service action x the problem?
IfThen
Yes:This ends the procedure.
No:Go to step “5” on page 11.
5. Go to “Resolving a PCIe adapter or device problem” on page 11. Then, continue with the next step.
6. Was a service action identied?
IfThen
Yes:Continue with the next step.
No:Go to “Collecting diagnostic data” on page 60. Then, go to “Contacting IBM service
and support” on page 61. This ends the procedure.
7. Did the service action x the problem?
IfThen
Yes:This ends the procedure.
No:Go to “Collecting diagnostic data” on page 60. Then, go to “Contacting IBM service
and support” on page 61. This ends the procedure.
Resolving a PCIe adapter or device problem
Learn how to access log les, information to identify types of events, and a list of potential problems and
service actions.
About this task
Procedure
1. To identify the correct service procedure to perform by using operating system log information,
complete the following steps:
a) Log in as the root user.
b) At the command prompt, type dmesg and press Enter.
2. Scan the operating system logs for the rst occurrence of keywords, such as fail, failure, or failed.
When you nd a keyword that accompanies one or more of the resource names in Table 3 on page
12, a service action is required.
Did you nd an operating system log that requires a service action?
If
Yes:Use Table 3 on page 12 to determine the service procedure to perform for your type
No:Continue with the next step.
Then
of problem. This ends the procedure.
Beginning troubleshooting and problem analysis 11
Table 3. Resource names, examples, and service procedures for different types of operating system
logs.
Resource nameExample of a log
requiring a service
action
eth1, eth2, eth3,
enPxxxxx, where xxxxx
indicates the network
port.
mlx5_coreLink Down
tg3PCI I/O error
nvmeFailed status:
sda, sdb, sdcFAILED ResultStorageGo to “Resolving a
EEHDetected error on
Failed to reinitialize device
health_care:
handling bad
device here
detected.
Link is Down
ffffffff, reset
controller
PHB#xxx, where xxx is
the PHB number.
Type of problemService procedure
NetworkGo to “Resolving a
network adapter
problem” on page 13.
NetworkGo to “Resolving a
network adapter
problem” on page 13.
NetworkGo to “Resolving a
network adapter
problem” on page 13.
NVMe Flash adapterGo to “Resolving an
NVMe Flash adapter
problem” on page 15.
storage device
problem” on page 15.
PCIe bus or adapterResolve any device
driver errors that are
related to I/O and that
occurred near the time
of this operating system
log entry.
xxx has failed 6
times in the last
hour and has been
permanently
disabled, where xxx
is the PCI bus number.
3. Are all of the adapters in the system missing or failed?
If
Yes:Perform the following actions, one at a time, until the problem is resolved:
Then
a. Ensure that the PCIe risers are fully seated in the system.
b. Replace system processor CPU 1.
c. Replace the system backplane.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C
locations” on page 63 to identify the physical location and the removal and
replacement procedure.
PCIe bus or adapterEnsure that the correct
device drivers are
properly installed for
the device. If the
problem persists,
replace the adapter in
the PCIe slot that is
specied in the
operating system log
entry.
12 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
IfThen
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify
the physical location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify
the physical location and the removal and replacement procedure.
No:Go to “Collecting diagnostic data” on page 60. Then, go to “Contacting IBM service
and support” on page 61.
Resolving a network adapter problem
Learn about the possible problems and service actions that you can perform to resolve a network adapter
problem.
About this task
Note: To determine the location of the PCIe adapter, see “Identifying the location of the PCIe adapter by
using the slot number” on page 16.
Table 4. Network adapter problems and service actions
ProblemService action
System is unable to nd the adapter or the
negotiated PCIe bandwidth of the adapter is less
than expected
1. Verify that the adapter is properly seated in a
compatible slot.
2. Install the adapter in a different compatible
slot.
3. Verify that the drivers for the adapter are
installed.
4. Verify that the most recent rmware is installed
on the system, or install the most recent
rmware if it is not already installed.
5. Restart the system.
6. Replace the adapter.
7. If the adapter is connected to a PCIe riser,
replace the PCIe riser.
8. If the adapter is in UIO slot 1, UIO slot 2, or UIO
slot 3, replace CPU 1. Otherwise, replace CPU 2.
9. Replace the system backplane.
Beginning troubleshooting and problem analysis 13
Table 4. Network adapter problems and service actions (continued)
ProblemService action
Adapter suddenly stops working
1. If the system was recently installed, moved,
serviced, or upgraded, verify that the adapter is
seated properly and all associated cables are
correctly connected.
2. Inspect the PCIe socket and verify that there is
no dirt or debris in the socket.
3. Inspect the card and verify that it is not
physically damaged.
4. Verify that all cables are properly seated and
are not physically damaged. If you recently
added one or more new adapters, remove them
and then test to determine whether the failing
adapter is functioning properly again. If the
network adapter is functioning again, review the
IBM support tips to conrm that there are no
PCI address, driver, or rmware conflicts. Then,
reinstall the new adapters again one at a time
until all adapters function properly.
5. Replace the adapter.
6. If the adapter is connected to a PCIe riser,
replace the PCIe riser.
7. If the adapter is in UIO slot 1, UIO slot 2, or UIO
slot 3, replace CPU 1. Otherwise, replace CPU 2.
8. Replace the system backplane.
Link indicator light on the adapter is off
Link light on the adapter is on, but there is no
communication from the adapter
Other problemsFor information about adapter diagnostics, see
1. Verify that the cable functions properly by
testing it with a known working connection.
2. Verify that the port or ports on the switch are
enabled and functional.
3. Verify that the switch and adapter are
compatible.
4. Replace the adapter.
1. Verify that the most recent driver is installed, or
install the most recent driver if it is not already
installed.
2. Verify that the adapter and its link have
compatible settings, such as speed and duplex
conguration.
Supporting diagnostics. For information about
adapter user information, see User guides for PCIe
adapters.
14 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Resolving an NVMe Flash adapter problem
Learn about the possible problems and service actions that you can perform to resolve a Non-Volatile
Memory Express (NVMe) Flash adapter problem.
About this task
Note: To determine the location of the NVMe Flash adapter, see “Identifying the location of the NVMe
Flash adapter” on page 17.
Table 5. NVMe Flash adapter problems and service actions
ProblemService action
System is unable
to nd the NVMe
Flash adapter
NVMe Flash
adapter stops
working suddenly
Other problemsCheck the messages and resolve any other problems that are detected. Then, test
1. If the system was recently installed, moved, serviced, or upgraded, verify that the
NVMe Flash adapter is seated and installed properly.
2. Verify that the NVMe Flash adapter is compatible with the system.
3. Verify that the most recent rmware is installed on the system. Otherwise install
the most recent rmware if it is not already installed.
4. Replace the NVMe Flash adapter.
1. Check the system logs to verify whether the system detected a problem.
2. Replace the NVMe Flash adapter.
the NVMe Flash adapter again.
Resolving a storage device problem
Learn about the possible problems and service actions that you can perform to resolve a storage device
problem.
About this task
Note: To determine the location of the storage device, see “Identifying the location of the storage device”
on page 17.
Table 6. Storage device problems and service actions
ProblemService action
System is unable to nd more than one storage
device
1. If the system was recently installed, moved,
serviced, or upgraded, verify that the device is
seated and installed properly.
2. Verify that the device is compatible with your
system.
3. Verify that all internal cables are properly
seated and are not physically damaged.
4. Verify that the most recent rmware is installed
on the system, or install the most recent
rmware if it is not already installed.
5. If the devices are part of a RAID conguration,
ensure that the device has been enabled and is
part of an array.
6. Replace the cable that connects the disk drive
backplane to the system backplane.
Beginning troubleshooting and problem analysis 15
Table 6. Storage device problems and service actions (continued)
ProblemService action
System unable to nd a storage device
More than one storage device suddenly stops
working
1. If the system was recently installed, moved,
serviced, or upgraded, verify that the device is
seated and installed properly.
2. Verify that the device is compatible with your
system.
3. Verify that all internal cables are properly
seated and are not physically damaged.
4. Verify that the most recent rmware is installed
on the system, or install the most recent
rmware if it is not already installed.
5. If the device is part of a RAID conguration,
ensure that the device has been enabled and is
part of an array.
6. Install the device in an open or free slot. If the
device is able to be found replace the
component with the failing connector.
7. Replace the storage device.
8. Replace any applicable attached cable.
1. If the system was recently installed, moved,
serviced, or upgraded, verify that the device is
seated and installed properly.
2. Check the system logs to verify whether the
system detected a problem.
3. Replace the cable that connects the disk drive
backplane to the system backplane.
One storage device suddenly stops working
Other problemsCheck the messages and resolve any other
1. Verify that all internal cables are properly
seated and are not physically damaged.
2. Check the system logs to verify whether the
system detected a problem.
3. Replace the drive.
4. Replace the system backplane.
5. Replace the cable.
problems that were detected. Then, test the drive
again. If the drive continues not to function, refer
to the documentation for the drive.
Identifying the location of the PCIe adapter by using the slot number
The error message provides information to help you to determine the location of the PCIe adapter.
About this task
For example, the log might contain an error similar to the following text:
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Replace the PCIe adapter. Go to “5104-22C or 9006-22C locations” on page 63, “9006-12P locations”
on page 75, or “9006-22P locations” on page 91 and use the slot number information in the operating
system log to identify the physical location and the removal and replacement procedure.
Identifying the location of the NVMe Flash adapter
Use this procedure to identify the location of a Non-Volatile Memory Express (NVMe) Flash adapter.
Procedure
1. Does the operating system log contain the slot number? For example, the log might contain an error
message similar to the following text:
Yes:Replace the adapter. Go to “5104-22C or 9006-22C locations” on page 63,
“9006-12P locations” on page 75, or “9006-22P locations” on page 91 and use the
slot number information to identify the physical location and the removal and
replacement procedure. This ends the procedure.
No:Continue with the next step.
2. Locate the NVMe Flash adapter by using the PCI address:
a) The operating system log contains information about the NVMe Flash adapter in the form of a PCI
address. Record the PCI address information for the NVMe Flash adapter that has failed. For
example, in the operating system log message nvme 0006:01:00.0: Failed status:ffffffff, reset controller, the PCI address of the failing NVMe Flash adapter is
0006:01:00.0.
b) At the command line, type lscfg -vl pciaddress, where pciaddress is the NVMe Flash
adapter information that you recorded in step 2.a. Then, press Enter.
c) Record the slot number information that is in the location code eld.
d) Replace the adapter. Go to “5104-22C or 9006-22C locations” on page 63, “9006-12P locations”
on page 75, or “9006-22P locations” on page 91 and use the slot number information to identify
the physical location and the removal and replacement procedure. This ends the procedure.
Identifying the location of the storage device
Use this procedure to identify the location of a storage device.
About this task
The storage device location is determined in the drive removal and replacement procedures for your
system. See Removing a disk drive from the 5104-22C or 9006-22C system, Removing and replacing a
storage drive in the 9006-12P, or Removing and replacing a disk drive in the 9006-22P.
User guides for PCIe adapters
Use this information to nd the user guide for your PCIe adapter.
About this task
Use the following table to nd the user guide for the PCIe adapter that you are using.
Use the following procedures to help you identify the service action that is needed.
Identifying a service action by using system event logs
Use the Intelligent Platform Management Interface (IPMI) program to examine system event logs (SELs)
to identify a service action.
Procedure
1. Use the ipmitool command to examine SELs.
• To list SELs by using an in-band network, use the following command:
ipmitool sel elist
• To list SELs remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel elist
2. Scan the SELs for an event with the value OEM record de. Did you nd a SEL with the value OEM
record de?
If
Yes:Continue with the next step.
NoGo to step “4” on page 20.
3. The OEM record de specic log information is indicated by the rightmost digits of the SEL with the
value OEM record de. Use Table 1 to determine the service action to perform.
Table 8. OEM record de
OEM record de specic log informationService action
00xxxxxxxxxxGo to Getting xes and update the system
Then
specic log information and service action
rmware to the most recent level of rmware
that is available. If this SEL event continues to be
logged, go to “Collecting diagnostic data” on
page 60. Then, go to “Contacting IBM service
and support” on page 61.
01xxxxxxxxxxGo to “EPUB_PRC_FIND_DECONFIGURE_PART
isolation procedure” on page 49.
04xxxxxxxxxxGo to “EPUB_PRC_SP_CODE isolation
procedure” on page 50.
05xxxxxxxxxxGo to “EPUB_PRC_PHYP_CODE isolation
procedure” on page 50.
18 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 8. OEM record de specic log information and service action (continued)
OEM record de specic log informationService action
08xxxxxxxxxxGo to “EPUB_PRC_ALL_PROCS isolation
procedure” on page 50.
09xxxxxxxxxxGo to “EPUB_PRC_ALL_MEMCRDS isolation
procedure” on page 51.
0AxxxxxxxxxxGo to Getting xes and update the system
rmware to the most recent level of rmware
that is available. If this SEL event continues to be
logged, go to “Collecting diagnostic data” on
page 60. Then, go to “Contacting IBM service
and support” on page 61.
10xxxxxxxxxxGo to “EPUB_PRC_LVL_SUPPORT isolation
procedure” on page 52.
11xxxxxxxxxxGo to Getting xes and update the system
rmware to the most recent level of rmware
that is available. If this SEL event continues to be
logged, go to “Collecting diagnostic data” on
page 60. Then, go to “Contacting IBM service
and support” on page 61.
16xxxxxxxxxxGo to Getting xes and update the system
rmware to the most recent level of rmware
that is available. If this SEL event continues to be
logged, go to “Collecting diagnostic data” on
page 60. Then, go to “Contacting IBM service
and support” on page 61.
1CxxxxxxxxxxGo to Getting xes and update the system
rmware to the most recent level of rmware
that is available. If this SEL event continues to be
logged, go to “Collecting diagnostic data” on
page 60. Then, go to “Contacting IBM service
and support” on page 61.
22xxxxxxxxxxGo to
“EPUB_PRC_MEMORY_PLUGGING_ERROR
isolation procedure” on page 52.
2DxxxxxxxxxxGo to “EPUB_PRC_FSI_PATH isolation
procedure” on page 52.
30xxxxxxxxxxGo to “EPUB_PRC_PROC_AB_BUS isolation
procedure” on page 53.
31xxxxxxxxxxGo to “ EPUB_PRC_PROC_XYZ_BUS isolation
procedure” on page 54.
34xxxxxxxxxxGo to Getting xes and update the system
rmware to the most recent level of rmware
that is available. If this SEL event continues to be
logged, go to “Collecting diagnostic data” on
page 60. Then, go to “Contacting IBM service
and support” on page 61.
37xxxxxxxxxxGo to “EPUB_PRC_EIBUS_ERROR isolation
procedure” on page 55.
Beginning troubleshooting and problem analysis 19
Table 8. OEM record de specic log information and service action (continued)
OEM record de specic log informationService action
3FxxxxxxxxxxGo to “EPUB_PRC_POWER_ERROR isolation
procedure” on page 56.
4DxxxxxxxxxxGo to Getting xes and update the system
rmware to the most recent level of rmware
that is available. If this SEL event continues to be
logged, go to “Collecting diagnostic data” on
page 60. Then, go to “Contacting IBM service
and support” on page 61.
4FxxxxxxxxxxGo to “EPUB_PRC_MEMORY_UE isolation
procedure” on page 56.
55xxxxxxxxxxGo to “EPUB_PRC_HB_CODE isolation
procedure” on page 57.
56xxxxxxxxxxGo to “EPUB_PRC_TOD_CLOCK_ERR isolation
procedure” on page 58.
5CxxxxxxxxxxGo to “EPUB_PRC_COOLING_SYSTEM_ERR
isolation procedure” on page 59.
5DxxxxxxxxxxGo to Getting xes and update the system
rmware to the most recent level of rmware
that is available. If this SEL event continues to be
logged, go to “Collecting diagnostic data” on
page 60. Then, go to “Contacting IBM service
and support” on page 61.
5ExxxxxxxxxxGo to Getting xes and update the system
rmware to the most recent level of rmware
that is available. If this SEL event continues to be
logged, go to “Collecting diagnostic data” on
page 60. Then, go to “Contacting IBM service
and support” on page 61.
This ends the procedure.
4. Scan the SELs for an event with the value OEM record df. Did you nd a SEL with the value OEM
record df?
If
Yes:Continue with the next step.
NoGo to step “10” on page 21.
5. One or more events might be logged around the same time as the event with the value OEM record
df. These events require a service action if they meet the following criteria:
• A service action keyword is present. For a list of service action keywords, see “Identifying service
action keywords in system event logs” on page 23.
• Asserted is in the description.
• OEM record is not in the description.
• The event has a time stamp close to the time stamp of the event with the value OEM record df.
6. Did you nd any SEL events that require a service action as dened in step “5” on page 20?
If
Then
Then
Yes:Continue with the next step.
20 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
IfThen
No:Go to “Collecting diagnostic data” on page 60. Then, go to “Contacting IBM
service and support” on page 61.
7. Did you nd only one SEL event that requires a service action as dened in step “5” on page 20?
IfThen
Yes:Continue with the next step.
No:Go to step “9” on page 21.
8. Record the SEL record ID for the event you identied in step “5” on page 20. The SEL record ID is
indicated by the leftmost digits of the SEL. Use the ipmitool command to display the SEL details.
• To display SEL details by using an in-band network, use the following command:
ipmitool sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
• To display SEL details remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC
hostname> sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
The sensor ID eld contains sensor information in the format sensor name (sensor ID). Record the
sensor name, sensor ID, and event description. Then, use the following information to determine the
service action to perform:
• If your system is a 5104-22C, 9006-12P, 9006-22C, or 9006-22P, go to “Identifying a service
action by using sensor and event information for the 5104-22C, 9006-12P, 9006-22C, or
9006-22P” on page 24.
This ends the procedure.
9. You identied more than one event in step “5” on page 20. The service actions for all of the events
that were identied in step “5” on page 20 must be performed to successfully complete the repair.
Record the SEL record IDs for the events that you identied in step “5” on page 20. The SEL record ID
is indicated by the leftmost digits of the SEL. Use the ipmitool command to display SEL details for
each SEL record ID that you recorded.
• To display SEL details by using an in-band network, use the following command:
ipmitool sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
• To display SEL details remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC
hostname> sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
The sensor ID eld contains sensor information in the format sensor name (sensor ID). Record the
sensor name, sensor ID, and event description. Then, use this information to determine the service
action to perform:
• If your system is a 5104-22C, 9006-12P, 9006-22C, or 9006-22P, go to “Identifying a service
action by using sensor and event information for the 5104-22C, 9006-12P, 9006-22C, or
9006-22P” on page 24.
This ends the procedure.
10. Scan the SEL for an event with the value OEM record c0.
11. Did you nd an event with the value OEM record c0?
Beginning troubleshooting and problem analysis
21
IfThen
Yes:Continue with the next step.
No:Go to step “13” on page 22.
12. The OEM record c0 specic log information is indicated by the rightmost digits of the SEL with the
value OEM record c0. Use Table 9 on page 22 to determine the service action to perform.
Table 9. OEM record c0 specic log information, description, and service action
OEM record c0 specic log
information
2aff6ffxxxxxA session audit event occurredNo service action is required.
cdxx6fffffffAn automatic shutdown event
ceff6fffffffA machine check event
cfff6fffffffAn unexpected problem
DescriptionService action
occurred due to high system
temperature
occurred
occurred with the voltage
regulator output
• Search for SEL events that are
related to high system
temperature and resolve
them.
• Ensure that the room
temperature meets the
requirements that are
specied for the system.
• Ensure that there are no air
flow obstructions at the front
or at the rear of the system.
Search for serviceable SEL
events and resolve them.
If a machine check event is
present with a time stamp close
to the time stamp of this event,
search for serviceable SEL
events and resolve them. If a
machine check event is not
present with a time stamp close
to the time stamp of this event,
reboot the system to recover
from the system hang. If the
problem persists, replace the
system backplane.
13. One or more SEL events might require a service action. These events require a service action if they
meet the following criteria:
• A service action keyword is present. For a list of service action keywords, see “Identifying service
action keywords in system event logs” on page 23.
• Asserted is in the description.
• OEM record is not in the description.
14. Did you nd one or more SEL events that require a service action as dened in step “13” on page 22?
If
Yes:Continue with the next step.
No:This ends the procedure.
15. The service actions for all of the events that were identied in step “13” on page 22 must be
performed to successfully complete the repair. Record the SEL record IDs for the events that you
22
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Then
identied in step “13” on page 22. The SEL record ID is indicated by the leftmost digits of the SEL.
Use the ipmitool command to display SEL details for each SEL record ID that you recorded.
• To display SEL details by using an in-band network, use the following command:
ipmitool sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
• To display SEL details remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC
hostname> sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
The sensor ID eld contains sensor information in the format sensor name (sensor ID). Record the
sensor name, sensor ID, and event description. Then, use this information to determine the service
action to perform:
• If your system is a 5104-22C, 9006-12P, 9006-22C, or 9006-22P, go to “Identifying a service
action by using sensor and event information for the 5104-22C, 9006-12P, 9006-22C, or
9006-22P” on page 24.
This ends the procedure.
Identifying service action keywords in system event logs
System event logs (SELs) that have Asserted and any of the keywords indicated below in the description
require a service action.
Temperature and voltage service action keywords
• Transition to Critical from Less Severe
• Transition to Critical from Non-recoverable
• Transition to Non-recoverable
• Transition to Non-recoverable from Less Severe
Backplane service action keywords
• State Asserted
Chassis service action keywords
• General Chassis intrusion
Fan service action keywords
• Transition to Critical from Less Severe
• Transition to Non-recoverable from Less Severe
• Transition to Critical from Non-recoverable
• Device Removed / Device Absent
• Transition to degraded
• Install error
• Redundancy lost
• Non-redundant insufcient resources
Memory service action keywords
• Conguration Error
Beginning troubleshooting and problem analysis
23
• Transition to Non-recoverable
• Predictive Failure
Processor service action keywords
• IERR
• Transition to Non-recoverable
• Predictive Failure
• Device Disabled
Power supply service action keywords
• Power Supply Failure Detected
• Predictive Failure
• Power Supply Input Lost or AC DC
• Power Supply Input Lost Or Out of Range
• Power Supply Input Out of Range But Present
System event service action keywords
• Undetermined system hardware failure
Watchdog service action keywords
• Hard Reset
• Power Down
• Power Cycle
• Timer Interrupt
Identifying a service action by using sensor and event information
You can use sensor and event information from the system event log (SEL) to determine a service action.
Identifying a service action by using sensor and event information for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P
You can use the sensor and event information from the system event log to determine a service action to
perform for the 5104-22C, 9006-12P, 9006-22C, or 9006-22P.
Procedure
If you have not done so already, complete “Identifying a service action by using system event logs” on
page 18. Then, use the following table to determine the service action to perform.
24
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P
Sensor name (Sensor ID)Event descriptionService action
System Temp (0x01)
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
Ensure that there are no air flow
obstructions at the front or at the
rear of the system. Ensure that
the fans are operating properly.
No service action is required.
Beginning troubleshooting and problem analysis 25
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
Peripheral Temp (0x02)
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
Ensure that the room
temperature meets the
requirements that are specied
for the system. Ensure that there
are no air flow obstructions at the
front or at the rear of the system.
No service action is required.
Backplane Fault (0x03)State DeassertedNo service action is required.
State AssertedReplace the system backplane.
Go to “5104-22C or 9006-22C
locations” on page 63,
“9006-12P locations” on page
75, or “9006-22P locations” on
page 91 to identify the physical
location and removal and
replacement procedure.
Unknown Backplane Fault (0x03) Transition to Non-recoverablePower on the system. If a
message is displayed that
indicates that the TPM card is
missing, reseat the TPM card. If
the problem persists, replace the
TPM card. Go to “5104-22C or
9006-22C locations” on page
63, “9006-12P locations” on
page 75, or “9006-22P
locations” on page 91 to
identify the physical location and
removal and replacement
procedure.
26 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
System Event (0x04)Undetermined system hardware
failure
• System Recongured
• OEM System boot event
• Entry added to auxiliary log
• PEF Action
• Timestamp Clock Sync
Boot Progress (0x05)
• PCIE CPU1 Pwr (0x06)
• PCIE CPU2 Pwr (0x07)
• Unknown Error
• Unknown Hang
• Unknown Progress
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
Go to “Collecting diagnostic
data” on page 60. Then, go to
“Contacting IBM service and
support” on page 61.
No service action is required.
No service action required.
• OCC Active 1 (0x08)
• OCC Active 2 (0x09)
Device DisabledIf the sensor name is OCC Active
1, replace CPU 1. If the sensor
name is OCC Active 2, replace
CPU 2. Go to “5104-22C or
9006-22C locations” on page
63, “9006-12P locations” on
page 75, or “9006-22P
locations” on page 91 to
identify the physical location and
removal and replacement
procedure.
• State Deasserted
• Device Enabled
Beginning troubleshooting and problem analysis 27
No service action is required.
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• CPU1 Temp (0x0B)
• CPU2 Temp (0x0D)
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical - going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
Ensure that there are no air flow
obstructions at the front or at the
rear of the system. Ensure that
the fans are operating properly.
No service action is required.
28 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• CPU Func 1 (0x0C)
• CPU Func 2 (0x0E)
• IERR
• Transition to Non-recoverable
• Predictive Failure
• Thermal Trip
• FRB1 BIST Failure
• FRB2 Hang In POST Failure
• FRB3 Processor Startup
Initialization Failure
• Conguration Error
• SMBIOS Uncorrectable CPU
Complex Error
• Processor Disabled
• Terminator Presence Detected
• Processor Automatically
Throttled
• Machine Check Exception
• Correctable Machine Check
Error
• State Deasserted
• Device Disabled
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Processor Presence Detected
• State Asserted
• Device Enabled
• Transition to OK
• Transition to Non-Critical from
OK
• Transition to Non-Critical from
More Severe
• Monitor
• Informational
If the sensor name is CPU Func 1,
replace CPU 1. If the sensor
name is CPU Func 2, replace CPU
2. Go to “5104-22C or 9006-22C
locations” on page 63,
“9006-12P locations” on page
75, or “9006-22P locations” on
page 91 to identify the physical
location and removal and
replacement procedure.
No service action is required.
Beginning troubleshooting and problem analysis 29
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• P1-DIMMA1 Func (0x10)
• P1-DIMMA2 Func (0x11)
• P1-DIMMB1 Func (0x12)
• P1-DIMMB2 Func (0x13)
• P1-DIMMC1 Func (0x14)
• P1-DIMMC2 Func (0x15)
• P1-DIMMD1 Func (0x16)
• P1-DIMMD2 Func (0x17)
• P2-DIMMA1 Func (0x18)
• P2-DIMMA2 Func (0x19)
• P2-DIMMB1 Func (0x1A)
• P2-DIMMB2 Func (0x1B)
• P2-DIMMC1 Func (0x1C)
• P2-DIMMC2 Func (0x1D)
• P2-DIMMD1 Func (0x1E)
• P2-DIMMD2 Func (0x1F)
• Memory Device Disabled
• Uncorrectable Memory Error
• Memory Scrub Failed
• State Deasserted
• Device Disabled
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Correctable Memory Error
• Parity
• Correctable Memory Error
Logging Limit Reached
• Memory Automatically
Throttled
• Critical Over temperature
• Presence Detected
• Spare
• State Asserted
• Device Enabled
• Transition to OK
• Transition to Non-Critical from
OK
• Transition to Non-Critical from
More Severe
• Monitor
• Informational
No service action is required.
• Transition to Non-recoverable
• Predictive Failure
30 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
If the sensor name is P1DIMMA1 Func, replace P1DIMMA1. If the sensor name is
P1-DIMMA2 Func, replace P1DIMMA2. And so on. Go to
“5104-22C or 9006-22C
locations” on page 63,
“9006-12P locations” on page
75, or “9006-22P locations” on
page 91 to identify the physical
location and removal and
replacement procedure.
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• P1-DIMMA1 Func (0x10)
• P1-DIMMA2 Func (0x11)
• P1-DIMMB1 Func (0x12)
• P1-DIMMB2 Func (0x13)
• P1-DIMMC1 Func (0x14)
• P1-DIMMC2 Func (0x15)
• P1-DIMMD1 Func (0x16)
• P1-DIMMD2 Func (0x17)
• P2-DIMMA1 Func (0x18)
• P2-DIMMA2 Func (0x19)
• P2-DIMMB1 Func (0x1A)
• P2-DIMMB2 Func (0x1B)
• P2-DIMMC1 Func (0x1C)
• P2-DIMMC2 Func (0x1D)
• P2-DIMMD1 Func (0x1E)
• P2-DIMMD2 Func (0x1F)
Conguration ErrorComplete the following steps:
a. If the sensor name is P1-
DIMMA1 Func, ensure that
P1-DIMMA1 is seated
properly. If the sensor name
is P1-DIMMA2 Func, ensure
that P1-DIMMA2 is seated
properly. And so on.
b. If you recently installed or
replaced memory DIMMs,
ensure that the DIMMs are
plugged in the correct
memory slots.
c. If the sensor name is P1-
DIMMA1 Func, replace P1DIMMA1. If the sensor name
is P1-DIMMA2 Func, replace
P1-DIMMA2. And so on. Go to
“5104-22C or 9006-22C
locations” on page 63,
“9006-12P locations” on page
75, or “9006-22P locations”
on page 91 to identify the
physical location and removal
and replacement procedure.
Beginning troubleshooting and problem analysis
31
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• P1-DIMMA1 Temp (0x20)
• P1-DIMMA2 Temp (0x21)
• P1-DIMMB1 Temp (0x22)
• P1-DIMMB2 Temp (0x23)
• P1-DIMMC1 Temp (0x24)
• P1-DIMMC2 Temp (0x25)
• P1-DIMMD1 Temp (0x26)
• P1-DIMMD2 Temp (0x27)
• P2-DIMMA1 Temp (0x28)
• P2-DIMMA2 Temp (0x29)
• P2-DIMMB1 Temp (0x2A)
• P2-DIMMB2 Temp (0x2B)
• P2-DIMMC1 Temp (0x2C)
• P2-DIMMC2 Temp (0x2D)
• P2-DIMMD1 Temp (0x2E)
• P2-DIMMD2 Temp (0x2F)
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
Ensure that there are no air flow
obstructions at the front or at the
rear of the system. Ensure that
the fans are operating properly.
No service action is required.
• CPU Core Temp 25 (0x30)
• CPU Core Temp 26 (0x31)
• CPU Core Temp 27 (0x32)
• CPU Core Temp 28 (0x33)
• CPU Core Temp 29 (0x34)
• CPU Core Temp 30 (0x35)
• CPU Core Temp 31 (0x36)
• CPU Core Temp 32 (0x37)
• CPU Core Temp 33 (0x38)
• CPU Core Temp 34 (0x39)
• CPU Core Temp 35 (0x3A)
• CPU Core Temp 36 (0x3B)
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
No service action is required.
32 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• CPU Core Temp 37 (0x3C)
• CPU Core Temp 38 (0x3D)
• CPU Core Temp 39 (0x3E)
• CPU Core Temp 40 (0x3F)
• CPU Core Temp 41 (0x40)
• CPU Core Temp 42 (0x41)
• CPU Core Temp 43 (0x42)
• CPU Core Temp 44 (0x43)
• CPU Core Temp 45 (0x44)
• CPU Core Temp 46 (0x45)
• CPU Core Temp 47 (0x46)
• CPU Core Temp 48 (0x47)
• Turbo Allowed (0x48)
• TPM Required (0x49)
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
• State Deasserted
• State Asserted
No service action is required.
No service action is required.
Beginning troubleshooting and problem analysis 33
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• SAS Temp (0x4A)
• HDD Temp (0x4B)
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
Ensure that the ambient
temperature is within operating
specications. Ensure that there
are no blockages to the air inlet
and outlets. If blockages are
found, remove them. Ensure that
all of the fans are working
properly by looking for
serviceable events related to fans
and resolving them.
No service action is required.
HDD Status (0x4C)
34 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
• State Deasserted
• State Asserted
No service action is required.
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
Total Power (0x4D)
• CPU1 Power (0x4E)
• CPU2 Power (0x4F)
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
No service action is required.
No service action required.
Beginning troubleshooting and problem analysis 35
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• CPU Core Func 25 (0x50)
• CPU Core Func 26 (0x51)
• CPU Core Func 27 (0x52)
• CPU Core Func 28 (0x53)
• CPU Core Func 29 (0x54)
• CPU Core Func 30 (0x55)
• CPU Core Func 31 (0x56)
• CPU Core Func 32 (0x57)
• CPU Core Func 33 (0x58)
• CPU Core Func 34 (0x59)
• CPU Core Func 35 (0x5A)
• CPU Core Func 36 (0x5B)
• IERR
• Transition to Non-recoverable
• Predictive Failure
• FRB1 BIST Failure
• FRB2 Hang In POST Failure
• FRB3 Processor Startup
Initialization Failure
• Conguration Error
• SMBIOS Uncorrectable CPU
Complex Error
• Processor Disabled
• Terminator Presence Detected
• Machine Check Exception
• Correctable Machine Check
Error
• State Deasserted
• Device Disabled
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Thermal Trip
• Processor Automatically
Throttled
• Processor Presence Detected
• State Asserted
• Device Enabled
• Transition to OK
• Transition to Non-Critical from
OK
• Transition to Non-Critical from
More Severe
• Monitor
• Informational
Replace system processor CPU 1.
Go to “5104-22C or 9006-22C
locations” on page 63,
“9006-12P locations” on page
75, or “9006-22P locations” on
page 91 to identify the physical
location and removal and
replacement procedure.
No service action is required.
36 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• CPU Core Func 37 (0x5C)
• CPU Core Func 38 (0x5D)
• CPU Core Func 39 (0x5E)
• CPU Core Func 40 (0x5F)
• CPU Core Func 41 (0x60)
• CPU Core Func 42 (0x61)
• CPU Core Func 43 (0x62)
• CPU Core Func 44 (0x63)
• CPU Core Func 45 (0x64)
• CPU Core Func 46 (0x65)
• CPU Core Func 47 (0x66)
• CPU Core Func 48 (0x67)
• IERR
• Transition to Non-recoverable
• Predictive Failure
• FRB1 BIST Failure
• FRB2 Hang In POST Failure
• FRB3 Processor Startup
Initialization Failure
• Conguration Error
• SMBIOS Uncorrectable CPU
Complex Error
• Processor Disabled
• Terminator Presence Detected
• Machine Check Exception
• Correctable Machine Check
Error
• State Deasserted
• Device Disabled
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Thermal Trip
• Processor Automatically
Throttled
• Processor Presence Detected
• State Asserted
• Device Enabled
• Transition to OK
• Transition to Non-Critical from
OK
• Transition to Non-Critical from
More Severe
• Monitor
• Informational
Replace system processor CPU 2.
Go to “5104-22C or 9006-22C
locations” on page 63,
“9006-12P locations” on page
75, or “9006-22P locations” on
page 91 to identify the physical
location and removal and
replacement procedure.
No service action is required.
Beginning troubleshooting and problem analysis 37
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• Freq Limit OT 1 (0x68)
• Mem Thrttl OT 1 (0x6A)
• Freq Limit OT 2 (0x6C)
• Mem Thrttl OT 2 (0x6E)
Performance MetIf Asserted is in the event
description, no service action is
required.
If Deasserted is in the event
description, ensure that the
ambient temperature is within
operating specications. Ensure
that there are no blockages to
the air inlet and outlets. If
blockages are found, remove
them. Ensure that all of the fans
are working properly by looking
for serviceable events related to
fans and resolving them.
Performance LagsIf Deasserted is in the event
description, no service action is
required.
If Asserted is in the event
description, ensure that the
ambient temperature is within
operating specications. Ensure
that there are no blockages to
the air inlet and outlets. If
blockages are found, remove
them. Ensure that all of the fans
are working properly by looking
for serviceable events related to
fans and resolving them.
38 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• Freq Limit Pwr 1 (0x69)
• Freq Limit Pwr 2 (0x6D)
Performance MetIf Asserted is in the event
description, no service action is
required.
If Deasserted is in the event
description, ensure that both
power supplies are working
properly. Search for serviceable
events related to system power
and voltage and resolve them.
Ensure all fans are working
properly by looking for
serviceable events related to fans
and resolving them.
Performance LagsIf Deasserted is in the event
description, no service action is
required.
If Asserted is in the event
description, ensure that both
power supplies are working
properly. Search for serviceable
events related to system power
and voltage and resolve them.
Ensure all fans are working
properly by looking for
serviceable events related to fans
and resolving them.
Beginning troubleshooting and problem analysis 39
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
VBAT (0x9C)
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
Replace the time-of-day battery.
Go to “5104-22C or 9006-22C
locations” on page 63,
“9006-12P locations” on page
75, or “9006-22P locations” on
page 91 to identify the physical
location and removal and
replacement procedure.
No service action is required.
40 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• GPU1 Temp (0xA0)
• GPU2 Temp (0xA1)
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
• Ensure that there are no air
flow obstructions at the front or
at the rear of the system.
• Ensure that the fans are
operating properly.
No service action required.
• CPU Core Temp 1 (0xB0)
• CPU Core Temp 2 (0xB1)
• CPU Core Temp 3 (0xB2)
• CPU Core Temp 4 (0xB3)
• CPU Core Temp 5 (0xB4)
• CPU Core Temp 6 (0xB5)
• CPU Core Temp 7 (0xB6)
• CPU Core Temp 8 (0xB7)
• CPU Core Temp 9 (0xB8)
• CPU Core Temp 10 (0xB9)
• CPU Core Temp 11 (0xBA)
• CPU Core Temp 12 (0xBB)
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
No service action is required.
Beginning troubleshooting and problem analysis 41
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• CPU Core Temp 13 (0xBC)
• CPU Core Temp 14 (0xBD)
• CPU Core Temp 15 (0xBE)
• CPU Core Temp 16 (0xBF)
• CPU Core Temp 17 (0xC0)
• CPU Core Temp 18 (0xC1)
• CPU Core Temp 19 (0xC2)
• CPU Core Temp 20 (0xC3)
• CPU Core Temp 21 (0xC4)
• CPU Core Temp 22 (0xC5)
• CPU Core Temp 23 (0xC6)
• CPU Core Temp 24 (0xC7)
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
No service action is required.
42 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• CPU Core Func 1 (0xC8)
• CPU Core Func 2 (0xC9)
• CPU Core Func 3 (0xCA)
• CPU Core Func 4 (0xCB)
• CPU Core Func 5 (0xCC)
• CPU Core Func 6 (0xCD)
• CPU Core Func 7 (0xCE)
• CPU Core Func 8 (0xCF)
• CPU Core Func 9 (0xD0)
• CPU Core Func 10 (0xD1)
• CPU Core Func 11 (0xD2)
• CPU Core Func 12 (0xD3)
• IERR
• Transition to Non-recoverable
• Predictive Failure
• FRB1 BIST Failure
• FRB2 Hang In POST Failure
• FRB3 Processor Startup
Initialization Failure
• Conguration Error
• SMBIOS Uncorrectable CPU
Complex Error
• Processor Disabled
• Terminator Presence Detected
• Machine Check Exception
• Correctable Machine Check
Error
• State Deasserted
• Device Disabled
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Thermal Trip
• Processor Automatically
Throttled
• Processor Presence Detected
• State Asserted
• Device Enabled
• Transition to OK
• Transition to Non-Critical from
OK
• Transition to Non-Critical from
More Severe
• Monitor
• Informational
Replace system processor CPU 1.
Go to “5104-22C or 9006-22C
locations” on page 63,
“9006-12P locations” on page
75, or “9006-22P locations” on
page 91 to identify the physical
location and removal and
replacement procedure.
No service action is required.
Beginning troubleshooting and problem analysis 43
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• CPU Core Func 13 (0xD4)
• CPU Core Func 14 (0xD5)
• CPU Core Func 15 (0xD6)
• CPU Core Func 16 (0xD7)
• CPU Core Func 17 (0xD8)
• CPU Core Func 18 (0xD9)
• CPU Core Func 19 (0xDA)
• CPU Core Func 20 (0xDB)
• CPU Core Func 21 (0xDC)
• CPU Core Func 22 (0xDD)
• CPU Core Func 23 (0xDE)
• CPU Core Func 24 (0xDF)
• IERR
• Transition to Non-recoverable
• Predictive Failure
• FRB1 BIST Failure
• FRB2 Hang In POST Failure
• FRB3 Processor Startup
Initialization Failure
• Conguration Error
• SMBIOS Uncorrectable CPU
Complex Error
• Processor Disabled
• Terminator Presence Detected
• Machine Check Exception
• Correctable Machine Check
Error
• State Deasserted
• Device Disabled
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Thermal Trip
• Processor Automatically
Throttled
• Processor Presence Detected
• State Asserted
• Device Enabled
• Transition to OK
• Transition to Non-Critical from
OK
• Transition to Non-Critical from
More Severe
• Monitor
• Informational
Replace system processor CPU 2.
Go to “5104-22C or 9006-22C
locations” on page 63,
“9006-12P locations” on page
75, or “9006-22P locations” on
page 91 to identify the physical
location and removal and
replacement procedure.
No service action is required.
44 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
MB_10G Temp (0xE0)
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
Ensure that there are no air flow
obstructions at the front or at the
rear of the system. Ensure that
the fans are operating properly.
No service action is required.
Beginning troubleshooting and problem analysis 45
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
NVMe_SSD Temp (0xE1)
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
Ensure that there are no air flow
obstructions at the front or at the
rear of the system. Ensure that
the fans are operating properly.
No service action is required.
Chassis Intru (0xE2)
• Drive Bay intrusion
• I/O Card area intrusion
• Processor area intrusion
• System unplugged from LAN
• Unauthorized dock
• FAN area intrusion
General Chassis intrusionEnsure that the top cover is
No service action is required.
properly installed on the system.
See Installing the service access
cover on an 5104-22C,
9006-22C, or 9006-22P system
or Installing the service access
cover on an 9006-12P system.
46 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• FAN1 (0xE3)
• FAN2 (0xE4)
• FAN3 (0xE5)
• FAN4 (0xE6)
• FAN5 (0xE7)
• FAN6 (0xE8)
• FAN7 (0xE9)
• FAN8 (0xEA)
• Transition to Critical from Less
Severe
• Transition to Non-recoverable
from Less Severe
• Transition to Critical from Nonrecoverable
• Lower Non-critical – going low
• Lower Non-critical – going high
• Lower Critical – going low
• Lower Critical – going high
• Lower Non-recoverable – going
low
• Lower Non-recoverable – going
high
• Upper Non-critical – going low
• Upper Non-critical – going high
• Upper Critical - going low
• Upper Critical - going high
• Upper Non-recoverable – going
low
• Upper Non-recoverable – going
high
• Device Inserted/Device Present
If the sensor name is FAN1,
FAN4, FAN5, or FAN8, no service
action is required. If the sensor
name is FAN2, replace Fan 2. If
the sensor name is FAN3, replace
Fan 3. If the sensor name is
FAN6, replace Fan 6. If the
sensor name is FAN7, replace
Fan 7. Go to “5104-22C or
9006-22C locations” on page
63, “9006-12P locations” on
page 75, or “9006-22P
locations” on page 91 to
identify the physical location and
removal and replacement
procedure.
No service action is required.
• Device Removed/Device
Absent
• Transition to degraded
• Install error
• Redundancy lost
• Non-redundant insufcient
resources
Beginning troubleshooting and problem analysis 47
Ensure that all fans are seated
securely. Go to “5104-22C or
9006-22C locations” on page
63, “9006-12P locations” on
page 75, or “9006-22P
locations” on page 91 to
identify the physical location and
removal and replacement
procedure.
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
• PS1 Status (0xF3)
• PS2 Status (0xF4)
• Predictive Failure
• Power Supply Input Out of
Range But Present
Power Supply Failure DetectedAn assert event immediately
If the sensor name is PS1 Status,
replace PSU 1. If the sensor
name is PS2 Status, replace PSU
2. Go to “5104-22C or 9006-22C
locations” on page 63,
“9006-12P locations” on page
75, or “9006-22P locations” on
page 91 to identify the physical
location and removal and
replacement procedure.
followed by a deassert event
indicates that a power cycle of
the system occurred. No service
action is required. If there is no
deassert event immediately
following the assert event,
replace the power supply. If the
sensor name is PS1 Status,
replace PSU 1. If the sensor
name is PS2 Status, replace PSU
2. Go to “5104-22C or 9006-22C
locations” on page 63,
“9006-12P locations” on page
75, or “9006-22P locations” on
page 91 to identify the physical
location and removal and
replacement procedure.
• Power Supply Input Lost or AC
DC
• Power Supply Input Lost Or Out
Of Range
• State Deasserted
• State Asserted
• Presence Detected
• Ensure that AC power is
supplied to the rack.
• Ensure that the system power
cords are plugged tightly into
both the power supply and the
rack power distribution unit
(PDU) for both system power
supplies.
No service action is required.
48 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 10. Sensor information, event description, and service action for the 5104-22C, 9006-12P,
9006-22C, or 9006-22P (continued)
Sensor name (Sensor ID)Event descriptionService action
Watchdog (0xFF)
• Timer Expired
• Reserved1
• Reserved2
• Reserved3
• Reserved4
• Hard Reset
• Power Down
• Power Cycle
• Timer Interrupt
Isolation procedures
Use this information to isolate problems that might occur with your system.
Search for serviceable SEL events
that have a time stamp close to
the time stamp of this SEL event.
If you found a serviceable SEL
event, perform the service action
that is indicated in this table for
the SEL event. If you cannot boot
the system to the Petitboot
menu, go to “Resolving a system
rmware boot failure” on page 5.
Procedure
1. Use the ipmitool command to examine system event logs (SELs).
• To list SELs by using an in-band network, use the following command:
ipmitool sel elist
• To list SELs remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel
elist
2. Identify all SELs with the value OEM record df and Correctable Machine Check Error or Transition
to Non-recoverable in the description. Did you nd one or more SELs with the value OEM record dfand Correctable Machine Check Error or Transition to Non-recoverable in the description?
If
Yes:Continue with the next step.
No:Go to “Contacting IBM service and support” on page 61. This ends the procedure.
3. For each of the SELs that you identied in step “2” on page 49, determine the sensor name that is
associated with each SEL. Replace the following items, one at a time until the problem is resolved:
Note:
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page 63 to
identify the physical location and removal and replacement procedure.
Then
Beginning troubleshooting and problem analysis
49
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
• If the sensor name is CPU Func 1 or CPU Core Func x, where x is 1 - 12, replace system processor
CPU 1.
• If the sensor name is CPU Func 2 or CPU Core Func x, where x is 13 - 24, replace system processor
CPU 2.
Does the problem persist?
IfThen
Yes:Replace the system backplane. If the replacement of the system backplane does not
resolve the problem, go to “Contacting IBM service and support” on page 61. This
ends the procedure.
No:This ends the procedure.
EPUB_PRC_SP_CODE isolation procedure
A problem was detected in the system rmware.
About this task
Update the system rmware image. Go to Getting xes and update the system rmware with the most
recent level of rmware. Then, reboot the system. If the system rmware update does not resolve the
problem, go to “Contacting IBM service and support” on page 61. This ends the procedure.
EPUB_PRC_PHYP_CODE isolation procedure
A problem was detected in the system rmware.
About this task
Update the system rmware image. Go to Getting xes
recent level of rmware. Then, reboot the system. If the system rmware update does not resolve the
problem, go to “Contacting IBM service and support” on page 61. This ends the procedure.
EPUB_PRC_ALL_PROCS isolation procedure
A problem was detected with a system processor.
About this task
Use the following table to determine the service action:
and update the system rmware with the most
50
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 11. EPUB_PRC_ALL_PROCS service actions
SystemService action
5104-22C or 9006-22CReplace the following items, one at a time, in the
order that is shown until the problem is resolved:
1. System processor CPU 1
2. System processor CPU 2
3. System backplane
Go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and removal
and replacement procedure. If the replacement of
the system processors and the system backplane
does not resolve the problem, go to “Contacting
IBM service and support” on page 61. This ends
the procedure.
9006-12PReplace the following items, one at a time, in the
order that is shown until the problem is resolved:
1. System processor CPU 1
2. System processor CPU 2
3. System backplane
9006-22PReplace the following items, one at a time, in the
EPUB_PRC_ALL_MEMCRDS isolation procedure
A problem was detected with a memory DIMM, but it cannot be isolated to a specic memory DIMM.
Go to “9006-12P locations” on page 75 to
identify the physical location and removal and
replacement procedure. If the replacement of the
system processors and the system backplane does
not resolve the problem, go to “Contacting IBM
service and support” on page 61. This ends the
procedure.
order that is shown until the problem is resolved:
1. System processor CPU 1
2. System processor CPU 2
3. System backplane
Go to “9006-22P locations” on page 91 to
identify the physical location and removal and
replacement procedure. If the replacement of the
system processors and the system backplane does
not resolve the problem, go to “Contacting IBM
service and support” on page 61. This ends the
procedure.
Procedure
1. Use the ipmitool command to examine system event logs (SELs).
• To list SELs by using an in-band network, use the following command:
ipmitool sel elist
Beginning troubleshooting and problem analysis
51
• To list SELs remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel
elist
2. Identify all SELs with the value OEM record df and Transition to Non-recoverable in the
description. Did you nd one or more SELs with the value OEM record df and Transition to Non-
recoverable in the description?
IfThen
Yes:Continue with the next step.
No:Go to “Contacting IBM service and support” on page 61. This ends the procedure.
3. For each of the SELs that you identied in step “2” on page 52, determine the sensor name that is
associated with each SEL. Replace the following items, one at a time, until the problem is resolved:
Note:
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page 63 to
identify the physical location and removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91
location and the removal and replacement procedure.
• If the sensor name is Membuf Func x, replace the system backplane.
• If the sensor name is P1-DIMMA1 Func, replace P1-DIMMA1. If the sensor name is P1-DIMMA2
Func, replace P1-DIMMA2. And so on.
to identify the physical
Does the problem persist?
If
Yes:If you have not already done so, replace the system backplane. If the replacement of
No:This ends the procedure.
Then
the system backplane does not resolve the problem, go to “Contacting IBM service
and support” on page 61. This ends the procedure.
EPUB_PRC_LVL_SUPPORT isolation procedure
Contact your next level of support for assistance.
About this task
Go to “Contacting IBM service and support” on page 61.
Memory DIMMs are plugged in a conguration that is not valid.
EPUB_PRC_FSI_PATH isolation procedure
The system detected an error with the FSI path.
About this task
Use the following table to determine the service action:
52
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 12. EPUB_PRC_FSI_PATH service actions
SystemService action
5104-22C or 9006-22CReplace the following items, one at a time, in the
order that is shown until the problem is resolved:
1. System processor CPU 1
2. System processor CPU 2
3. System backplane
Go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and removal
and replacement procedure. If the replacement of
the system processors and the system backplane
does not resolve the problem, go to “Contacting
IBM service and support” on page 61. This ends
the procedure.
9006-12PReplace the following items, one at a time, in the
order that is shown until the problem is resolved:
1. System processor CPU 1
2. System processor CPU 2
3. System backplane
9006-22PReplace the following items, one at a time, in the
EPUB_PRC_PROC_AB_BUS isolation procedure
A diagnostic function detected an external processor interface problem.
Go to “9006-12P locations” on page 75 to
identify the physical location and removal and
replacement procedure. If the replacement of the
system processors and the system backplane does
not resolve the problem, go to “Contacting IBM
service and support” on page 61. This ends the
procedure.
order that is shown until the problem is resolved:
1. System processor CPU 1
2. System processor CPU 2
3. System backplane
Go to “9006-22P locations” on page 91 to
identify the physical location and removal and
replacement procedure. If the replacement of the
system processors and the system backplane does
not resolve the problem, go to “Contacting IBM
service and support” on page 61. This ends the
procedure.
About this task
Use the following table to determine the service action:
Beginning troubleshooting and problem analysis
53
Table 13. EPUB_PRC_PROC_AB_BUS service actions
SystemService action
5104-22C or 9006-22CReplace the system backplane. If replacing the
system backplane does not resolve the problem,
replace system processor CPU 1. If replacing
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2. Go to
“5104-22C or 9006-22C locations” on page 63 to
identify the physical location and removal and
replacement procedure.
If replacing the system backplane and both system
processors does not resolve the problem, go to
“Contacting IBM service and support” on page
61. This ends the procedure.
9006-12P
9006-22P
Replace the system backplane. If replacing the
system backplane does not resolve the problem,
replace system processor CPU 1. If replacing
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2. Go to
“9006-12P locations” on page 75 to identify the
physical location and removal and replacement
procedure.
If replacing the system backplane and both system
processors does not resolve the problem, go to
“Contacting IBM service and support” on page
61. This ends the procedure.
Replace the system backplane. If replacing the
system backplane does not resolve the problem,
replace system processor CPU 1. If replacing
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2. Go to
“9006-22P locations” on page 91 to identify the
physical location and removal and replacement
procedure.
If replacing the system backplane and both system
processors does not resolve the problem, go to
“Contacting IBM service and support” on page
61. This ends the procedure.
EPUB_PRC_PROC_XYZ_BUS isolation procedure
A diagnostic function detected an internal processor interface problem.
About this task
Use the following table to determine the service action:
54
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 14. EPUB_PRC_PROC_XYZ_BUS service actions
SystemService action
5104-22C or 9006-22CReplace system processor CPU 1. If replacing
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2. If
replacing both system processors does not resolve
the problem, replace the system backplane. Go to
“5104-22C or 9006-22C locations” on page 63 to
identify the physical location and removal and
replacement procedure.
If replacing the system backplane and both system
processors does not resolve the problem, go to
“Contacting IBM service and support” on page
61. This ends the procedure.
9006-12PReplace system processor CPU 1. If replacing
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2. If
replacing both system processors does not resolve
the problem, replace the system backplane. Go to
“9006-12P locations” on page 75 to identify the
physical location and removal and replacement
procedure.
9006-22PReplace system processor CPU 1. If replacing
EPUB_PRC_EIBUS_ERROR isolation procedure
A bus error occurred.
Procedure
1. Use the ipmitool command to examine system event logs (SELs).
If replacing the system backplane and both system
processors does not resolve the problem, go to
“Contacting IBM service and support” on page
61. This ends the procedure.
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2. If
replacing both system processors does not resolve
the problem, replace the system backplane. Go to
“9006-22P locations” on page 91 to identify the
physical location and removal and replacement
procedure.
If replacing the system backplane and both system
processors does not resolve the problem, go to
“Contacting IBM service and support” on page
61. This ends the procedure.
• To list SELs by using an in-band network, use the following command:
ipmitool sel elist
• To list SELs remotely over the LAN, use the following command:
Beginning troubleshooting and problem analysis
55
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel
elist
2. Identify all SELs with the value OEM record df and Correctable Machine Check Error or Transition
to Non-recoverable in the description. Did you nd one or more SELs with the value OEM record dfand Correctable Machine Check Error or Transition to Non-recoverable in the description?
IfThen
Yes:Continue with the next step.
No:Go to “Contacting IBM service and support” on page 61. This ends the procedure.
3. For each of the SELs that you identied in step “2” on page 56, determine the sensor name that is
associated with each SEL. Replace the following items, one at a time until the problem is resolved:
Note:
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page 63 to
identify the physical location and removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
• If the sensor name is CPU Func 1 or CPU Core Func x, where x is 1 - 12, replace system processor
CPU 1.
• If the sensor name is CPU Func 2 or CPU Core Func x, where x is 13 - 24, replace system processor
CPU 2.
Does the problem persist?
If
Yes:Replace the system backplane. If the replacement of the system backplane does not
No:This ends the procedure.
Then
resolve the problem, go to “Contacting IBM service and support” on page 61. This
ends the procedure.
EPUB_PRC_POWER_ERROR isolation procedure
A power problem occurred.
About this task
Perform the service action that is indicated for any system event logs that are related to power and
occurred prior to the problem that you are working on. Go to “Identifying a service action by using system
event logs” on page 18. This ends the procedure.
EPUB_PRC_MEMORY_UE isolation procedure
An uncorrectable memory problem occurred.
Procedure
1. Look for system event logs that are related to memory and occurred around the same time as the
problem that you are working on. Go to “Identifying a service action by using system event logs” on
page 18. Did you nd any system event logs that are related to memory?
If
Then
Yes:Perform the service actions that are indicated for the system event logs that are
related to memory. This ends the procedure.
56 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
IfThen
No:Continue with the next step.
2. Use the following table to determine the service action:
Table 15. EPUB_PRC_MEMORY_UE service actions
SystemService action
5104-22C or 9006-22CReplace system processor CPU 1. If replacing the
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2.
Go to “5104-22C or 9006-22C locations” on
page 63 to identify the physical location and
removal and replacement procedure. This ends
the procedure.
9006-12PReplace system processor CPU 1. If replacing the
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2.
Go to “9006-12P locations” on page 75 to
identify the physical location and removal and
replacement procedure. This ends the
procedure.
9006-22PReplace system processor CPU 1. If replacing the
EPUB_PRC_HB_CODE isolation procedure
The service processor detected a problem during the early boot process.
Procedure
1. Update the system rmware image. Go to Getting xes and update the system rmware with the most
recent level of rmware. Then, reboot the system. Does the problem persist?
If
Yes:Continue with the next step.
No:This ends the procedure.
2. Use the ipmitool command to examine system event logs (SELs).
• To list SELs by using an in-band network, use the following command:
Then
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2.
Go to “9006-22P locations” on page 91 to
identify the physical location and removal and
replacement procedure. This ends the
procedure.
ipmitool sel elist
• To list SELs remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel
elist
Beginning troubleshooting and problem analysis
57
3. Identify all SELs with the value OEM record df and Correctable Machine Check Error or Transition
to Non-recoverable in the description. Did you nd one or more SELs with the value OEM record dfand Correctable Machine Check Error or Transition to Non-recoverable in the description?
IfThen
Yes:Continue with the next step.
No:Go to “Contacting IBM service and support” on page 61. This ends the procedure.
4. For each of the SELs that you identied in step “3” on page 58, determine the sensor name that is
associated with each SEL. Replace the following items, one at a time, until the problem is resolved:
Note:
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page 63 to
identify the physical location and removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
• If the sensor name is CPU Func 1 or CPU Core Func x, where x is 1 - 12, replace system processor
CPU 1.
• If the sensor name is CPU Func 2 or CPU Core Func x, where x is 13 - 24, replace system processor
CPU 2.
Does the problem persist?
If
Yes:Replace the system backplane. If the replacement of the system backplane does not
No:This ends the procedure.
Then
resolve the problem, go to “Contacting IBM service and support” on page 61. This
ends the procedure.
EPUB_PRC_TOD_CLOCK_ERR isolation procedure
A diagnostic function detected a problem with the time of day or clock function.
About this task
Use the following table to determine the service action:
Table 16. EPUB_PRC_TOD_CLOCK_ERR service actions
SystemService action
5104-22C or 9006-22CReplace the system backplane. If replacing the
system backplane does not resolve the problem,
replace system processor CPU 1. If replacing
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2. Go to
“5104-22C or 9006-22C locations” on page 63 to
identify the physical location and removal and
replacement procedure.
If replacing the system backplane and both system
processors does not resolve the problem, go to
“Contacting IBM service and support” on page
61. This ends the procedure.
58 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 16. EPUB_PRC_TOD_CLOCK_ERR service actions (continued)
SystemService action
9006-12PReplace the system backplane. If replacing the
system backplane does not resolve the problem,
replace system processor CPU 1. If replacing
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2. Go to
“9006-12P locations” on page 75 to identify the
physical location and removal and replacement
procedure.
If replacing the system backplane and both system
processors does not resolve the problem, go to
“Contacting IBM service and support” on page
61. This ends the procedure.
9006-22PReplace the system backplane. If replacing the
system backplane does not resolve the problem,
replace system processor CPU 1. If replacing
system processor CPU 1 does not resolve the
problem, replace system processor CPU 2. Go to
“9006-22P locations” on page 91 to identify the
physical location and removal and replacement
procedure.
If replacing the system backplane and both system
processors does not resolve the problem, go to
“Contacting IBM service and support” on page
61. This ends the procedure.
EPUB_PRC_COOLING_SYSTEM_ERR isolation procedure
One or more processor sensors detected an over temperature condition.
About this task
To resolve the over temperature condition, complete the following steps:
Procedure
1. Is the room temperature less than 35°C (95°F)?
If
No:Bring the room temperature to within the allowable operating range. This ends the
Yes:Continue with the next step.
2. Are the system front and rear doors free of obstructions?
If
No:The system must be free of obstructions for proper air flow. Remove any obstructions.
Then
procedure.
Then
This ends the procedure.
Yes:Continue with the next step.
3. Perform the service action that is indicated for any system event logs that are related to fans and
occurred prior to the problem that you are working on. Go to “Identifying a service action by using
system event logs” on page 18. This ends the procedure.
Beginning troubleshooting and problem analysis
59
Verifying a repair
Learn how to verify hardware operation after you make repairs to the system.
Procedure
1. Power on the system.
2. Did you replace a PCIe adapter or device?
IfThen
Yes:Go to step “5” on page 60.
No:Continue with the next step.
3. Scan the system event logs (SELs) for serviceable events that occurred after system hardware was
replaced. For information about SELs that require a service action, see “Identifying a service action by
using system event logs” on page 18.
4. Did any serviceable SEL events occur after hardware was replaced?
IfThen
Yes:The problem is not resolved. Go to “Identifying a service action by using system event
No:The problem is resolved. This ends the procedure.
5. Use the following table to determine the verication action to complete:
logs” on page 18 and complete the service actions indicated. This ends the
procedure.
Table 17. Determining a
Adapter typeVerication action
Devices that are not controlled by a RAID adapter If the device is a SAS or SATA drive, complete the
Network adapterComplete the following steps:
verication action for PCIe adapters and devices
Collecting diagnostic data
following steps:
a. Install the arcconf utility.
b. Type arcconf getsmartstats 1 and press
Enter.
c. Verify that the SMART health assessment
passed.
a. At the command line, type ethtool ethx,
where x is the number of the physical port that
you are testing. Verify that the connection
speed that is indicated in the output is correct.
b. Perform a ping test to verify the network
connectivity.
Learn how to collect diagnostic data to send to IBM service and support.
About this task
To collect diagnostic data, complete the following steps:
60
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Procedure
1. Is the operating system available?
IfThen
Yes:Continue with step “2” on page 61.
No:Continue with step “3” on page 61.
2. To collect diagnostic data from the operating system, complete the following steps:
a) Log in as root user.
b) At the command prompt, type sosreport and press Enter.
c) You are prompted for additional information. When the command is complete, the location of the
output le is displayed. Note the location of the output le. Then, continue with the next step.
3. To collect system event logs, complete the following steps:
a) Go to the IBM Support Portal (http://www.ibm.com/support/entry/portal/support).
b) In the search eld, enter your machine type and model. Then, click the correct product support
entry for your system.
c) From the Downloads list, click the Scale-out LC System Event Log Collection Tool for your
machine type and model.
d) Follow the instructions to install and run the system event log collection tool. Then, continue with
the next step.
4. Send the data that you collected during this procedure to IBM service and support. This ends the
procedure.
Contacting IBM service and support
You can contact IBM service and support by telephone or through the IBM Support Portal.
Before you contact IBM service and support, go to “Beginning troubleshooting and problem analysis” on
page 1 and complete all of the service actions indicated. If the service actions do not resolve the problem,
or if you are directed to contact support, go to “Collecting diagnostic data” on page 60. Then, use the
information below to contact IBM service and support.
Customers in the United States, United States territories, or Canada can place a hardware service request
online. To place a hardware service request online, go to the IBM Support Portal (http://www.ibm.com/
support/entry/portal/product/power/scale-out_lc).
For up-to-date telephone contact information, go to the Directory of worldwide contacts website
(www.ibm.com/planetwide/).
Table 18. Service and support contacts
Type of problemCall
• Advice
• Migrating
• "How to"
• Operating
• Conguring
• Ordering
• Performance
• General information
• 1-800-IBM-CALL (1–800–426–2255)
• 1-800-IBM-4YOU (1–800–426–4968)
Beginning troubleshooting and problem analysis 61
Table 18. Service and support contacts (continued)
Type of problemCall
Software:
• Fix information
• Operating system problem
• IBM application program
• Loop, hang, or message
Hardware:
• IBM system hardware broken
• Hardware reference code
• IBM input/output (I/O) problem
• Upgrade
1-800-IBM-SERV (1–800–426–7378)
62 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Finding parts and locations
Locate physical part locations and identify parts with system diagrams.
Locate the FRU
Use the graphics and tables to locate the eld-replaceable unit (FRU) and identify the FRU part number.
5104-22C or 9006-22C locations
Use this information to nd the location of a FRU in the system unit.
Rack views
The following diagrams show eld-replaceable unit (FRU) layouts in the system. Use these diagrams with
the following tables.
Figure 1. Front view
Table 19. Front view locations
Index numberFRU descriptionFRU removal and replacement
Index numberFRU descriptionFRU removal and replacement
procedures
7HDD 6See Removing and replacing a
disk drive in the 5104-22C or
9006-22C.
8HDD 7See Removing and replacing a
disk drive in the 5104-22C or
9006-22C.
9HDD 8See Removing and replacing a
disk drive in the 5104-22C or
9006-22C.
10HDD 9See Removing and replacing a
disk drive in the 5104-22C or
9006-22C.
11HDD 10See Removing and replacing a
disk drive in the 5104-22C or
9006-22C.
12HDD 11See Removing and replacing a
disk drive in the 5104-22C or
9006-22C.
Figure 2. Top view
64
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 20. Top view locations
Index numberFRU descriptionFRU removal and replacement
procedures
13Disk drive backplaneSee Removing and replacing the
disk drive backplane in the
5104-22C or 9006-22C.
14Fan 2See Removing and replacing fans
in the 5104-22C or 9006-22C.
15Fan 3See Removing and replacing fans
in the 5104-22C or 9006-22C.
16Fan 6See Removing and replacing fans
in the 5104-22C or 9006-22C.
17Fan 7See Removing and replacing fans
in the 5104-22C or 9006-22C.
18CPU 1See Removing and replacing a
system processor module for the
5104-22C or 9006-22C.
19CPU 2See Removing and replacing a
system processor module for the
5104-22C or 9006-22C.
20Time-of-day batterySee Removing and replacing the
time-of-day battery in the
5104-22C or 9006-22C.
21System backplaneSee Removing and replacing the
system backplane in the
5104-22C or 9006-22C.
22PSU 1See Removing and replacing a
power supply in the 5104-22C or
9006-22C.
23PSU 2See Removing and replacing a
power supply in the 5104-22C or
9006-22C.
Figure 3. Rear view
Finding parts and locations
65
Table 21. Rear view locations
Index numberFRU descriptionFRU removal and replacement
procedures
22PSU 1See Removing and replacing a
power supply in the 5104-22C or
9006-22C.
23PSU 2See Removing and replacing a
power supply in the 5104-22C or
9006-22C.
25PCIe adapter 0 (UIO Slot2)See Removing and replacing PCIe
adapters in the 5104-22C or
9006-22C.
26PCIe adapter 1 (UIO Slot3)See Removing and replacing PCIe
adapters in the 5104-22C or
9006-22C.
27PCIe adapter 2 (UIO Slot1)See Removing and replacing PCIe
adapters in the 5104-22C or
9006-22C.
28PCIe adapter 3 (WIO Slot3)See Removing and replacing PCIe
adapters in the 9006-22C.
29PCIe adapter 4 (WIO-R Slot)See Removing and replacing PCIe
adapters in the 9006-22C.
30PCIe adapter 5 (WIO Slot2)See Removing and replacing PCIe
adapters in the 5104-22C or
9006-22C.
31PCIe adapter 6 (WIO Slot4)See Removing and replacing PCIe
adapters in the 5104-22C or
9006-22C.
32PCIe adapter 7 (WIO Slot1)See Removing and replacing PCIe
adapters in the 5104-22C or
9006-22C.
33PCIe riser and network adapter
(UIO Network)
Memory locations
The following diagram shows memory DIMMs and their corresponding eld-replaceable unit (FRU)
layouts in the system. Use this diagram with the following table.
See Removing and replacing PCIe
adapters in the 5104-22C or
9006-22C.
66
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Figure 4. Memory locations
The following table provides the memory locations.
Table 22. Memory locations
Index numberFRU descriptionFRU removal and replacement
procedures
34P1-DIMMA1See Removing and replacing
memory in the 5104-22C or
9006-22C.
35P1-DIMMA2See Removing and replacing
memory in the 5104-22C or
9006-22C.
36P1-DIMMB1See Removing and replacing
memory in the 5104-22C or
9006-22C.
37P1-DIMMB2See Removing and replacing
memory in the 5104-22C or
9006-22C.
38P1-DIMMC1See Removing and replacing
memory in the 5104-22C or
9006-22C.
39P1-DIMMC2See Removing and replacing
memory in the 5104-22C or
9006-22C.
Finding parts and locations 67
Table 22. Memory locations (continued)
Index numberFRU descriptionFRU removal and replacement
procedures
40P1-DIMMD1See Removing and replacing
memory in the 5104-22C or
9006-22C.
41P1-DIMMD2See Removing and replacing
memory in the 5104-22C or
9006-22C.
42P2-DIMMA1See Removing and replacing
memory in the 5104-22C or
9006-22C.
43P2-DIMMA2See Removing and replacing
memory in the 5104-22C or
9006-22C.
44P2-DIMMB1See Removing and replacing
memory in the 5104-22C or
9006-22C.
45P2-DIMMB2See Removing and replacing
memory in the 5104-22C or
9006-22C.
46P2-DIMMC1See Removing and replacing
47P2-DIMMC2See Removing and replacing
48P2-DIMMD1See Removing and replacing
49P2-DIMMD2See Removing and replacing
5104-22C or 9006-22C parts
Use this information to nd the eld-replaceable unit (FRU) part number.
After you identify the part number of the part that you want to order, go to Advanced Part Exchange
Warranty Service. Registration is required. If you are not able to identify the part number, go to Contacting
IBM service and support.
memory in the 5104-22C or
9006-22C.
memory in the 5104-22C or
9006-22C.
memory in the 5104-22C or
9006-22C.
memory in the 5104-22C or
9006-22C.
68
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Rack nal assembly
Figure 5. Rack nal assembly
Table 23. Rack
Index
number
101EM628
201EM628
nal assembly part numbers
IBM part
number
(Supermicro
part
number)
(MCP-290-0
0057-0N)
(MCP-290-0
0057-0N)
Units per
assembly
1Slide rail kit - contains left and right slide rails and
1Slide rail kit - contains left and right slide rails and
Description
attaching screws
attaching screws
Finding parts and locations 69
System parts
Figure 6. System parts
70
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 24. System parts
Index
number
11Top cover assembly
21CPU air baffle
301EM725 (SNK-
401EM276 (PP9-
IBM part number
(Supermicro part
number)
P0053P-IB001 )
MP02AA780-KIB001)
02CL503 (PP9MP02AA880-KIB001)
01EM360 (PP9MP02AA798-KIB001)
Units per
assembly
2Screws
2Heat sink kit (includes heat sink and thermal interface
1-216 core 2.9 GHz DD2.01 system processor module kit
1-216 core 2.9 GHz DD2.1 system processor module kit
(includes system processor module tray and removal tool)
Note: System processor modules with version DD2.01 are
not compatible with system processor modules with version
DD2.1 or DD2.11.
(includes system processor module tray and removal tool)
Note: System processor modules with version DD2.1 or
DD2.11 are not compatible with system processor modules
with version DD2.01.
(includes system processor module tray and removal tool)
(9006-22C)
02CL674 (PP9MP02AA986-KIB001)
501EM644 (MEM-
DR480L-SL02ER26)
01EM738 (MEMDR416L-SL02ER26)
01EM739 (MEMDR432L-SL02ER26)
601EM416 (MBD-
P9DSU-K0-IB001B)
Note: System processor modules with version DD2.01 are
not compatible with system processor modules with version
DD2.1 or DD2.11.
1-216 core 2.9 GHz DD2.11 system processor module kit
(includes system processor module tray and removal tool)
Note: System processor modules with version DD2.1 or
DD2.11 are not compatible with system processor modules
with version DD2.01.
168 GB, 2666 MHz 1RX4 DDR4 RDIMM* (9006-22C)
1616 GB, 2666 MHz 1RX4 DDR4 RDIMM* (5104-22C)
1632 GB, 2666 MHz 2RX4 DDR4 RDIMM*
1System backplane kit
710Screws
801EM604
(FAN-0166L4)
4Fan
Finding parts and locations 71
Table 24. System parts (continued)
Index
number
901EM614 (BPN-
107Screws
1101EM652 (HDD-
1201EM619
1301EM722 (RSC-
141PCI adapter. Use the feature type of the adapter to nd the
151PCI adapter. Use the feature type of the adapter to nd the
IBM part number
(Supermicro part
number)
SAS3-826A-N4)
A2000ST2000NM0135)
01EM654 (HDDA8000ST8000NM0075)
(PWS-1K22A-1R)
W2R-A8P)
Units per
assembly
1Disk drive backplane (supports 12 SAS or SATA drives)
122 TB 3.5 inch SAS disk drive
21.2KW power supply
1PCIe riser for PCIe adapter 4 (WIO-R Slot)
Description
8 TB 3.5 inch SAS disk drive (9006-22C)
FRU number in PCIe adapter information by feature type for
the 5104-22C or 9006-22C
FRU number in PCIe adapter information by feature type for
the 5104-22C or 9006-22C
1601EM721
(AOC-2UR688i4XTF-IB001)
171PCIe cage
184PCI adapters. Use the feature type of the adapter to nd the
191PCIe riser
2001EM723 (RSC-
W2-6688P)
211PCI adapter. Use the feature type of the adapter to nd the
*All of the memory in a 5104-22C or 9006-22C system must be the same size. The 5104-22C and
9006-22C systems do not support mixing different sizes of memory.
12U UIO NIC PCIe adapter with integrated 4-port 10 GbE
Base-T, Intel XL710, and CAPI
Note: This PCIe adapter is also a PCIe riser.
FRU number in PCIe adapter information by feature type for
the 5104-22C or 9006-22C.
Index numberFRU descriptionFRU removal and replacement
procedures
5Disk drive backplaneSee Removing and replacing the
disk drive backplane in the
9006-12P.
6Fan 1See Removing and replacing fans
in the 9006-12P.
7Fan 2See Removing and replacing fans
in the 9006-12P.
8Fan 3See Removing and replacing fans
in the 9006-12P.
9Fan 4See Removing and replacing fans
in the 9006-12P.
10Fan 5See Removing and replacing fans
in the 9006-12P.
11Fan 6See Removing and replacing fans
in the 9006-12P.
12Fan 7See Removing and replacing fans
in the 9006-12P.
13Fan 8See Removing and replacing fans
in the 9006-12P.
76 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 27. Top view locations (continued)
Index numberFRU descriptionFRU removal and replacement
procedures
14CPU 1See Removing and replacing a
system processor module for the
9006-12P.
15CPU 2See Removing and replacing a
system processor module for the
9006-12P.
16Time-of-day batterySee Removing and replacing the
time-of-day battery in the
9006-12P.
17System backplaneSee Removing and replacing the
system backplane in the
9006-12P.
18PSU 1See Removing and replacing a
power supply in the 9006-12P.
19PSU 2See Removing and replacing a
power supply in the 9006-12P.
20Trusted platform module (TPM)
card
Figure 9. Rear view
Table 28. Rear view locations
Index numberFRU descriptionFRU removal and replacement
18PSU 1See Removing and replacing a
19PSU 2See Removing and replacing a
21Network adapter and PCIe riser
(UIO Network)
See Removing and replacing the
TPM card in the 9006-12P.
procedures
power supply in the 9006-12P.
power supply in the 9006-12P.
See Removing and replacing PCIe
adapters in the 9006-12P.
22PCIe adapter 1 (UIO Slot1)
Note: This adapter does not have
any external connectors.
23PCIe adapter 2 (WIO Slot1)See Removing and replacing PCIe
24PCIe adapter 3 (WIO Slot2)See Removing and replacing PCIe
See Removing and replacing PCIe
adapters in the 9006-12P.
adapters in the 9006-12P.
adapters in the 9006-12P.
Finding parts and locations 77
Table 28. Rear view locations (continued)
Index numberFRU descriptionFRU removal and replacement
procedures
25PCIe adapter 4 (WIO-R Slot)See Removing and replacing PCIe
adapters in the 9006-12P.
Memory locations
The following diagram shows memory DIMMs and their corresponding eld-replaceable unit (FRU)
layouts in the system. Use this diagram with the following table.
Figure 10. Memory locations
The following table provides the memory locations.
Table 29. Memory locations
Index numberFRU descriptionFRU removal and replacement
procedures
26P1-DIMMA1See Removing and replacing
memory in the 9006-12P.
27P1-DIMMA2See Removing and replacing
memory in the 9006-12P.
28P1-DIMMB1See Removing and replacing
memory in the 9006-12P.
29P1-DIMMB2See Removing and replacing
memory in the 9006-12P.
78 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 29. Memory locations (continued)
Index numberFRU descriptionFRU removal and replacement
procedures
30P1-DIMMC1See Removing and replacing
memory in the 9006-12P.
31P1-DIMMC2See Removing and replacing
memory in the 9006-12P.
32P1-DIMMD1See Removing and replacing
memory in the 9006-12P.
33P1-DIMMD2See Removing and replacing
memory in the 9006-12P.
34P2-DIMMA1See Removing and replacing
memory in the 9006-12P.
35P2-DIMMA2See Removing and replacing
memory in the 9006-12P.
36P2-DIMMB1See Removing and replacing
memory in the 9006-12P.
37P2-DIMMB2See Removing and replacing
memory in the 9006-12P.
38P2-DIMMC1See Removing and replacing
memory in the 9006-12P.
39P2-DIMMC2See Removing and replacing
memory in the 9006-12P.
40P2-DIMMD1See Removing and replacing
memory in the 9006-12P.
41P2-DIMMD2See Removing and replacing
memory in the 9006-12P.
Drive on module (DOM) locations
The following diagram shows drive on module (DOM)s and their corresponding eld-replaceable unit
(FRU) layouts in the system. Use this diagram with the following table.
Finding parts and locations
79
Figure 11. Drive on module (DOM) locations
The following table provides the drive on module (DOM) locations.
Table 30. Drive on module (DOM) locations
Index numberFRU descriptionFRU removal and replacement
42Drive on module (DOM) 1See Removing and replacing a
43Drive on module (DOM) 2See Removing and replacing a
9006-12P parts
Use this information to nd the eld-replaceable unit (FRU) part number.
After you identify the part number of the part that you want to order, go to Advanced Part Exchange
Warranty Service. Registration is required. If you are not able to identify the part number, go to Contacting
IBM service and support.
procedures
storage drive in the 9006-12P.
storage drive in the 9006-12P.
80
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Rack nal assembly
Figure 12. Rack nal assembly
Table 31. Rack
Index
number
102CM137
202CM137
nal assembly part numbers
IBM part
number
(Supermicro
part
number)
(MCP-290-0
0052-0N)
(MCP-290-0
0052-0N)
Units per
assembly
1Slide rail kit - contains left and right slide rails and
1Slide rail kit - contains left and right slide rails and
Description
attaching screws
attaching screws
Finding parts and locations 81
System parts
Figure 13. System parts
Table 32. System parts
Index
number
11Top cover assembly
22PCIe adapter. Use the feature type of the adapter to nd the
301EM611 (RSC-
41PCIe cage
502CM139 (RSC-
61PCIe adapter. Use the feature type of the adapter to nd the
IBM part number
(Supermicro part
number)
W-66P-IB001)
R1UW-E8R-IB001)
Units per
assembly
2Screws
1PCIe riser for PCIe adapters. Use the feature type of the
1PCIe riser
Description
FRU number in PCIe adapter information by feature type for
the 9006-12P.
adapter to nd the FRU number in PCIe adapter information
by feature type for the 9006-12P.
FRU number in PCIe adapter information by feature type for
the 9006-12P
82 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 32. System parts (continued)
Index
number
71PCIe adapter. Use the feature type of the adapter to nd the
801EM608 (AOC-
902CM144
IBM part number
(Supermicro part
number)
UR-i4XTF-IB001)
(PWS-1K02A-1R)
Units per
assembly
11U UIO NIC PCIe adapter with integrated 4-port 10 GbE
2Power supply
Description
FRU number in PCIe adapter information by feature type for
the 9006-12P
Base-T, Intel XL710, and CAPI
Note: This PCIe adapter is also a PCIe riser.
Finding parts and locations 83
Table 32. System parts (continued)
Index
number
1001EM682 (HDD-
IBM part number
(Supermicro part
number)
KIT-2A-ST1200IB001)
01EM683 (HDDKIT-2A-ST1800IB001)
01EM652 (HDDA2000ST2000NM0135)
01EM653 (HDDA4000ST4000NM0125)
01EM654 (HDDA8000ST8000NM0075)
01EM655 (HDDA10TST10000NM0096)
Units per
assembly
41.2 TB 10k 2.5 inch SAS disk drive
41.8 TB 10k 2.5 inch SAS disk drive
42.0 TB 7.2K (512 block size) 3.5 inch SAS disk drive
44.0 TB 7.2K (512 block size) 3.5 inch SAS disk drive
48.0 TB 7.2K (512 block size) 3.5 inch SAS disk drive
410.0 TB 7.2K (512 block size) 3.5 inch SAS disk drive
Description
01EM656 (HDDA4000ST4000NM0075)
01EM657 (HDDA8000ST8000NM0095)
02CM136 (HDDT2000ST2000NM0125)
01EM659 (HDDT4000ST4000NM0115)
01EM660 (HDDT8000ST8000NM0055)
01EM661 (HDDA10TST10000NM0096)
44.0 TB 7.2K (4k block size) 3.5 inch self-encrypting SAS disk
drive
48.0 TB 7.2K (4k block size) 3.5 inch self-encrypting SAS disk
drive
42.0 TB 7.2K (512 block size) 3.5 inch SATA disk drive
44.0 TB 7.2K (512 block size) 3.5 inch SATA disk drive
48.0 TB 7.2K (512 block size) 3.5 inch SATA disk drive
410.0 TB 7.2K (512 block size) 3.5 inch SATA disk drive
84 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Table 32. System parts (continued)
Index
number
1001EM671 (HDS-
IBM part number
(Supermicro part
number)
KIT-2A-1920IB001)
01EM672 (HDSKIT-2A-3840IB001)
01EM673 (HDSKIT-2A3D-960IB001)
01EM674 (HDSKIT-2A3D-1920IB001)
01EM675 (HDSKIT-2A-1920SIB001)
01EM676 (HDSKIT-2A-3840SIB001)
Units per
assembly
41.92 TB 2.5 inch SAS solid-state drive (1 drive write per
43.84 TB 2.5 inch SAS solid-state drive (1 drive write per
4960 GB 2.5 inch SAS solid-state drive (3 drive writes per
41.92 TB 2.5 inch SAS solid-state drive (3 drive writes per
41.92 TB 2.5 inch self-encrypting SAS solid-state drive (1
43.84 TB 2.5 inch self-encrypting SAS solid-state drive (1
Description
day)
day)
day)
day)
drive write per day)
drive write per day)
01EM664 (HDSKIT-2T-240IB001)
01EM665 (HDSKIT-2T-960IB001)
01EM667 (HDSKIT-2T-3800IB001)
01EM666 (HDSKIT-2T-1900IB001)
01EM684 (HDSKIT-2T-480IB001)
4240 GB 2.5 inch self-encrypting SATA solid-state drive
(0.78 drive writes per day)
4960 GB 2.5 inch SATA solid-state drive (0.6 drive writes per
day)
43.84 TB 2.5 inch self-encrypting SATA solid-state drive
(0.78 drive writes per day)
41.92 TB 2.5 inch self-encrypting SATA solid-state drive
(0.78 drive writes per day)
4480 GB 2.5 inch self-encrypting SATA solid-state drive (3.5
drive writes per day)
Finding parts and locations 85
Table 32. System parts (continued)
Index
number
1001EM685 ( HDS-
IBM part number
(Supermicro part
number)
KIT-2T-960SIB001)
01EM686 (HDSKIT-2T-1920IB001)
01EM679 (HDSKIT-08N-960IB001)
01EM680 (HDSKIT-08N-1920IB001)
01EM681 (HDSKIT-08N-3840IB001)
01EM668 (HDSKIT-5N-800IB001)
Units per
assembly
4960 GB 2.5 inch self-encrypting SATA solid-state drive (3.5
41.92 TB 2.5 inch self-encrypting SATA solid-state drive (3.5
4960 GB 2.5 inch NVMe solid-state drive (0.8 drive writes per
41.92 TB 2.5 inch NVMe solid-state drive (0.8 drive writes
43.84 TB 2.5 inch NVMe solid-state drive (0.8 drive writes
4800 GB 2.5 inch NVMe solid-state drive (5 drive writes per
Description
drive writes per day)
drive writes per day)
day)
per day)
per day)
day)
01EM669 (HDSKIT-5N-1600IB001)
01EM670 (HDSKIT-5N-3200IB001)
1102CM140 (BPN-
SAS3-815TQ-N4)
122Screws
1302CM138
(FAN-0141L4)
142Fan holder
1501EM607
(MCP-310-819150B-OEM)
41.6 TB 2.5 inch NVMe solid-state drive (5 drive writes per
day)
43.2 TB 2.5 inch NVMe solid-state drive (5 drive writes per
day)
1Disk drive backplane
8Fan
2CPU air baffle
86 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.