IBM Power System, Power System 9006-22P, Power System 5104-22C, Power System 9006-12P, Power System 9006-22C Problem Analysis, System Parts, And Locations
Problem analysis, system parts, and
locations for the 5104-22C, 9006-12P,
9006-22C, and 9006-22P
IBM
Note
Before using this information and the product it supports, read the information in “Safety notices” on
page v, “Notices” on page 109, the IBM Systems Safety Notices manual, G229-9054, and the IBMEnvironmental Notices and User Guide, Z125–5823.
This edition applies to IBM® Power Systems servers that contain the POWER9™ processor and to all associated models.
Verifying a repair........................................................................................................................................60
Class A Notices...................................................................................................................................112
Class B Notices...................................................................................................................................115
Terms and conditions.............................................................................................................................. 117
iv
Safety notices
Safety notices may be printed throughout this guide:
• DANGER notices call attention to a situation that is potentially lethal or extremely hazardous to people.
• CAUTION notices call attention to a situation that is potentially hazardous to people because of some
existing condition.
• Attention notices call attention to the possibility of damage to a program, device, system, or data.
World Trade safety information
Several countries require the safety information contained in product publications to be presented in their
national languages. If this requirement applies to your country, safety information documentation is
included in the publications package (such as in printed documentation, on DVD, or as part of the product)
shipped with the product. The documentation contains the safety information in your national language
with references to the U.S. English source. Before using a U.S. English publication to install, operate, or
service this product, you must rst become familiar with the related safety information documentation.
You should also refer to the safety information documentation any time you do not clearly understand any
safety information in the U.S. English publications.
Replacement or additional copies of safety information documentation can be obtained by calling the IBM
Hotline at 1-800-300-8751.
German safety information
Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne § 2 der
Bildschirmarbeitsverordnung geeignet.
Laser safety information
IBM servers can use I/O cards or features that are ber-optic based and that utilize lasers or LEDs.
Laser compliance
IBM servers may be installed inside or outside of an IT equipment rack.
DANGER:
Electrical voltage and current from power, telephone, and communication cables are hazardous.
To avoid a shock hazard:
• If IBM supplied the power cord(s), connect power to this unit only with the IBM provided power
cord. Do not use the IBM provided power cord for any other product.
• Do not open or service any power supply assembly.
• Do not connect or disconnect any cables or perform installation, maintenance, or reconguration
of this product during an electrical storm.
• The product might be equipped with multiple power cords. To remove all hazardous voltages,
disconnect all power cords.
– For AC power, disconnect all power cords from their AC power source.
– For racks with a DC power distribution panel (PDP), disconnect the customer’s DC power
• When connecting power to the product ensure all power cables are properly connected.
When working on or around the system, observe the following precautions:
source to the PDP.
– For racks with AC power, connect all power cords to a properly wired and grounded electrical
outlet. Ensure that the outlet supplies proper voltage and phase rotation according to the
system rating plate.
– For racks with a DC power distribution panel (PDP), connect the customer’s DC power source
to the PDP. Ensure that the proper polarity is used when attaching the DC power and DC power
return wiring.
• Connect any equipment that will be attached to this product to properly wired outlets.
• When possible, use one hand only to connect or disconnect signal cables.
• Never turn on any equipment when there is evidence of re, water, or structural damage.
• Do not attempt to switch on power to the machine until all possible unsafe conditions are
corrected.
• Assume that an electrical safety hazard is present. Perform all continuity, grounding, and power
checks specied during the subsystem installation procedures to ensure that the machine meets
safety requirements.
• Do not continue with the inspection if any unsafe conditions are present.
• Before you open the device covers, unless instructed otherwise in the installation and
conguration procedures: Disconnect the attached AC power cords, turn off the applicable
circuit breakers located in the rack power distribution panel (PDP), and disconnect any
telecommunications systems, networks, and modems.
DANGER:
• Connect and disconnect cables as described in the following procedures when installing,
moving, or opening covers on this product or attached devices.
To Disconnect:
1. Turn off everything (unless instructed otherwise).
2. For AC power, remove the power cords from the outlets.
3. For racks with a DC power distribution panel (PDP), turn off the circuit breakers located in the
PDP and remove the power from the Customer's DC power source.
4. Remove the signal cables from the connectors.
5. Remove all cables from the devices.
To Connect:
1. Turn off everything (unless instructed otherwise).
2. Attach all cables to the devices.
3. Attach the signal cables to the connectors.
4. For AC power, attach the power cords to the outlets.
5. For racks with a DC power distribution panel (PDP), restore the power from the Customer's
DC power source and turn on the circuit breakers located in the PDP.
6. Turn on the devices.
Sharp edges, corners and joints may be present in and around the system. Use care when
handling equipment to avoid cuts, scrapes and pinching. (D005)
(R001 part 1 of 2):
DANGER:
• Heavy equipment–personal injury or equipment damage might result if mishandled.
• Always lower the leveling pads on the rack cabinet.
• Always install stabilizer brackets on the rack cabinet unless the earthquake option is to be
installed.
• To avoid hazardous conditions due to uneven mechanical loading, always install the heaviest
devices in the bottom of the rack cabinet. Always install servers and optional devices starting
from the bottom of the rack cabinet.
Observe the following precautions when working on or around your IT rack system:
vi Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
• Rack-mounted devices are not to be used as shelves or work spaces. Do not place objects on top
of rack-mounted devices. In addition, do not lean on rack mounted devices and do not use them
to stabilize your body position (for example, when working from a ladder).
• Stability hazard:
– The rack may tip over causing serious personal injury.
– Before extending the rack to the installation position, read the installation instructions.
– Do not put any load on the slide-rail mounted equipment mounted in the installation position.
– Do not leave the slide-rail mounted equipment in the installation position.
• Each rack cabinet might have more than one power cord.
– For AC powered racks, be sure to disconnect all power cords in the rack cabinet when directed
to disconnect power during servicing.
– For racks with a DC power distribution panel (PDP), turn off the circuit breaker that controls
the power to the system unit(s), or disconnect the customer’s DC power source, when
directed to disconnect power during servicing.
• Connect all devices installed in a rack cabinet to power devices installed in the same rack
cabinet. Do not plug a power cord from a device installed in one rack cabinet into a power device
installed in a different rack cabinet.
• An electrical outlet that is not correctly wired could place hazardous voltage on the metal parts
of the system or the devices that attach to the system. It is the responsibility of the customer to
ensure that the outlet is correctly wired and grounded to prevent an electrical shock. (R001 part
1 of 2)
(R001 part 2 of 2):
CAUTION:
• Do not install a unit in a rack where the internal rack ambient temperatures will exceed the
manufacturer's recommended ambient temperature for all your rack-mounted devices.
• Do not install a unit in a rack where the air flow is compromised. Ensure that air flow is not
blocked or reduced on any side, front, or back of a unit used for air flow through the unit.
• Consideration should be given to the connection of the equipment to the supply circuit so that
overloading of the circuits does not compromise the supply wiring or overcurrent protection. To
provide the correct power connection to a rack, refer to the rating labels located on the
equipment in the rack to determine the total power requirement of the supply circuit.
• (For sliding drawers.) Do not pull out or install any drawer or feature if the rack stabilizer
brackets are not attached to the rack or if the rack is not bolted to the floor. Do not pull out more
than one drawer at a time. The rack might become unstable if you pull out more than one drawer
at a time.
• (For xed drawers.) This drawer is a xed drawer and must not be moved for servicing unless
specied by the manufacturer. Attempting to move the drawer partially or completely out of the
rack might cause the rack to become unstable or cause the drawer to fall out of the rack. (R001
part 2 of 2)
Safety notices
vii
CAUTION: Removing components from the upper positions in the rack cabinet improves rack
stability during relocation. Follow these general guidelines whenever you relocate a populated
rack cabinet within a room or building.
• Reduce the weight of the rack cabinet by removing equipment starting at the top of the rack
cabinet. When possible, restore the rack cabinet to the conguration of the rack cabinet as you
received it. If this conguration is not known, you must observe the following precautions:
– Remove all devices in the 32U position (compliance ID RACK-001 or 22U (compliance ID
RR001) and above.
– Ensure that the heaviest devices are installed in the bottom of the rack cabinet.
– Ensure that there are little-to-no empty U-levels between devices installed in the rack cabinet
below the 32U (compliance ID RACK-001 or 22U (compliance ID RR001) level, unless the
received congurationspecically allowed it.
• If the rack cabinet you are relocating is part of a suite of rack cabinets, detach the rack cabinet
from the suite.
• If the rack cabinet you are relocating was supplied with removable outriggers they must be
reinstalled before the cabinet is relocated.
• Inspect the route that you plan to take to eliminate potential hazards.
• Verify that the route that you choose can support the weight of the loaded rack cabinet. Refer to
the documentation that comes with your rack cabinet for the weight of a loaded rack cabinet.
• Verify that all door openings are at least 760 x 230 mm (30 x 80 in.).
• Ensure that all devices, shelves, drawers, doors, and cables are secure.
• Ensure that the four leveling pads are raised to their highest position.
• Ensure that there is no stabilizer bracket installed on the rack cabinet during movement.
• Do not use a ramp inclined at more than 10 degrees.
• When the rack cabinet is in the new location, complete the following steps:
(L001)
(L002)
– Lower the four leveling pads.
– Install stabilizer brackets on the rack cabinet or in an earthquake environment bolt the rack to
the floor.
– If you removed any devices from the rack cabinet, repopulate the rack cabinet from the
lowest position to the highest position.
• If a long-distance relocation is required, restore the rack cabinet to the conguration of the rack
cabinet as you received it. Pack the rack cabinet in the original packaging material, or equivalent.
Also lower the leveling pads to raise the casters off of the pallet and bolt the rack cabinet to the
pallet.
(R002)
DANGER:
this label attached. Do not open any cover or barrier that contains this label. (L001)
Hazardous voltage, current, or energy levels are present inside any component that has
viii
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
(L003)
or
DANGER: Rack-mounted devices are not to be used as shelves or work spaces. Do not place
objects on top of rack-mounted devices. In addition, do not lean on rack-mounted devices and do
not use them to stabilize your body position (for example, when working from a ladder). Stability
hazard:
• The rack may tip over causing serious personal injury.
• Before extending the rack to the installation position, read the installation instructions.
• Do not put any load on the slide-rail mounted equipment mounted in the installation position.
• Do not leave the slide-rail mounted equipment in the installation position.
(L002)
or
or
Safety notices
ix
or
DANGER: Multiple power cords. The product might be equipped with multiple AC power cords or
multiple DC power cables. To remove all hazardous voltages, disconnect all power cords and
power cables. (L003)
(L007)
CAUTION:
x Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
A hot surface nearby. (L007)
(L008)
CAUTION: Hazardous moving parts nearby. (L008)
All lasers are certied in the U.S. to conform to the requirements of DHHS 21 CFR Subchapter J for class 1
laser products. Outside the U.S., they are certied to be in compliance with IEC 60825 as a class 1 laser
product. Consult the label on each part for laser certication numbers and approval information.
CAUTION: This product might contain one or more of the following devices: CD-ROM drive, DVDROM drive, DVD-RAM drive, or laser module, which are Class 1 laser products. Note the following
information:
• Do not remove the covers. Removing the covers of the laser product could result in exposure to
hazardous laser radiation. There are no serviceable parts inside the device.
• Use of the controls or adjustments or performance of procedures other than those specied
herein might result in hazardous radiation exposure.
(C026)
CAUTION: Data processing environments can contain equipment transmitting on system links
with laser modules that operate at greater than Class 1 power levels. For this reason, never look
into the end of an optical ber cable or open receptacle. Although shining light into one end and
looking into the other end of a disconnected optical ber to verify the continuity of optic bers may
not injure the eye, this procedure is potentially dangerous. Therefore, verifying the continuity of
optical bers by shining light into one end and looking at the other end is not recommended. To
verify continuity of a ber optic cable, use an optical light source and power meter. (C027)
CAUTION: This product contains a Class 1M laser. Do not view directly with optical instruments.
(C028)
CAUTION: Some laser products contain an embedded Class 3A or Class 3B laser diode. Note the
following information:
• Laser radiation when open.
• Do not stare into the beam, do not view directly with optical instruments, and avoid direct
exposure to the beam. (C030)
(C030)
CAUTION: The battery contains lithium. To avoid possible explosion, do not burn or charge the
battery.
Do Not:
• Throw or immerse into water
• Heat to more than 100 degrees C (212 degrees F)
• Repair or disassemble
Exchange only with the IBM-approved part. Recycle or discard the battery as instructed by local
regulations. In the United States, IBM has a process for the collection of this battery. For
information, call 1-800-426-4333. Have the IBM part number for the battery unit available when
you call. (C003)
CAUTION: Regarding IBM provided VENDOR LIFT TOOL:
• Operation of LIFT TOOL by authorized personnel only.
Safety notices xi
• LIFT TOOL intended for use to assist, lift, install, remove units (load) up into rack elevations. It is
not to be used loaded transporting over major ramps nor as a replacement for such designated
tools like pallet jacks, walkies, fork trucks and such related relocation practices. When this is not
practicable, specially trained persons or services must be used (for instance, riggers or movers).
• Read and completely understand the contents of LIFT TOOL operator's manual before using.
Failure to read, understand, obey safety rules, and follow instructions may result in property
damage and/or personal injury. If there are questions, contact the vendor's service and support.
Local paper manual must remain with machine in provided storage sleeve area. Latest revision
manual available on vendor's web site.
• Test verify stabilizer brake function before each use. Do not over-force moving or rolling the LIFT
TOOL with stabilizer brake engaged.
• Do not raise, lower or slide platform load shelf unless stabilizer (brake pedal jack) is fully
engaged. Keep stabilizer brake engaged when not in use or motion.
• Do not move LIFT TOOL while platform is raised, except for minor positioning.
• Do not exceed rated load capacity. See LOAD CAPACITY CHART regarding maximum loads at
center versus edge of extended platform.
• Only raise load if properly centered on platform. Do not place more than 200 lb (91 kg) on edge
of sliding platform shelf also considering the load's center of mass/gravity (CoG).
• Do not corner load the platforms, tilt riser, angled unit install wedge or other such accessory
options. Secure such platforms -- riser tilt, wedge, etc options to main lift shelf or forks in all four
(4x or all other provisioned mounting) locations with provided hardware only, prior to use. Load
objects are designed to slide on/off smooth platforms without appreciable force, so take care
not to push or lean. Keep riser tilt [adjustable angling platform] option flat at all times except for
nal minor angle adjustment when needed.
• Do not stand under overhanging load.
• Do not use on uneven surface, incline or decline (major ramps).
• Do not stack loads.
• Do not operate while under the influence of drugs or alcohol.
• Do not support ladder against LIFT TOOL (unless the specic allowance is provided for one
following qualied procedures for working at elevations with this TOOL).
• Tipping hazard. Do not push or lean against load with raised platform.
• Do not use as a personnel lifting platform or step. No riders.
• Do not stand on any part of lift. Not a step.
• Do not climb on mast.
• Do not operate a damaged or malfunctioning LIFT TOOL machine.
• Crush and pinch point hazard below platform. Only lower load in areas clear of personnel and
obstructions. Keep hands and feet clear during operation.
• No Forks. Never lift or move bare LIFT TOOL MACHINE with pallet truck, jack or fork lift.
• Mast extends higher than platform. Be aware of ceiling height, cable trays, sprinklers, lights, and
other overhead objects.
• Do not leave LIFT TOOL machine unattended with an elevated load.
• Watch and keep hands, ngers, and clothing clear when equipment is in motion.
• Turn Winch with hand power only. If winch handle cannot be cranked easily with one hand, it is
probably over-loaded. Do not continue to turn winch past top or bottom of platform travel.
Excessive unwinding will detach handle and damage cable. Always hold handle when lowering,
unwinding. Always assure self that winch is holding load before releasing winch handle.
• A winch accident could cause serious injury. Not for moving humans. Make certain clicking sound
is heard as the equipment is being raised. Be sure winch is locked in position before releasing
handle. Read instruction page before operating this winch. Never allow winch to unwind freely.
xii
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Freewheeling will cause uneven cable wrapping around winch drum, damage cable, and may
cause serious injury.
• This TOOL must be maintained correctly for IBM Service personnel to use it. IBM shall inspect
condition and verify maintenance history before operation. Personnel reserve the right not to use
TOOL if inadequate. (C048)
Power and cabling information for NEBS (Network Equipment-Building System) GR-1089-CORE
The following comments apply to the IBM servers that have been designated as conforming to NEBS
(Network Equipment-Building System) GR-1089-CORE:
The equipment is suitable for installation in the following:
• Network telecommunications facilities
• Locations where the NEC (National Electrical Code) applies
The intrabuilding ports of this equipment are suitable for connection to intrabuilding or unexposed wiring
or cabling only. The intrabuilding ports of this equipment must not be metallically connected to the
interfaces that connect to the OSP (outside plant) or its wiring. These interfaces are designed for use as
intrabuilding interfaces only (Type 2 or Type 4 ports as described in GR-1089-CORE) and require isolation
from the exposed OSP cabling. The addition of primary protectors is not sufcient protection to connect
these interfaces metallically to OSP wiring.
Note: All Ethernet cables must be shielded and grounded at both ends.
The ac-powered system does not require the use of an external surge protection device (SPD).
The dc-powered system employs an isolated DC return (DC-I) design. The DC battery return terminal shallnot be connected to the chassis or frame ground.
The dc-powered system is intended to be installed in a common bonding network (CBN) as described in
GR-1089-CORE.
Safety notices
xiii
xiv Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Beginning troubleshooting and problem analysis
This information provides a starting point for analyzing problems.
This information is the starting point for diagnosing and repairing systems. From this point, you are guided
to the appropriate information to help you diagnose problems, determine the appropriate repair action,
and then complete the necessary steps to repair the system.
Note: Update the system rmware to the latest level before you start problem analysis. If you update the
system rmware, you will have the latest available xes and improvements for error handling, reporting,
and isolation. For instructions about updating the system rmware, see Getting xes.
What type of problem are you dealing with?Problem analysis procedure
You do not know the type of problem.Go to “Determining the problem analysis
procedure to perform” on page 1.
A baseboard management controller (BMC) access
problem occurred.
The system does not power on (the power button
or the BMC power on command does not power on
the system).
A system rmware boot failure occurred (the
system started but was not able to boot to the
Petitboot menu).
A video graphics array (VGA) monitor problem
occurred (the system started but no video is
displayed on the monitor).
An operating system boot failure occurred (the
system booted to the Petitboot menu but the
operating system did not start).
A sensor on the sensor readings GUI display is red. Go to “Resolving a sensor indicator problem” on
A processor, memory, power, or cooling hardware
failure occurred.
Missing or faulty PCIe adapter or device.Go to Resolving a PCIe adapter or device problem.
You have an FQPSPxxxxxxx event code.Go to FQPSPxxxxxxx Event Codes.
Go to “Resolving a BMC access problem” on page
2.
Go to “Resolving a power problem” on page 5.
Go to “Resolving a system rmware boot failure”
on page 5.
Go to “Resolving a VGA monitor problem” on page
7.
Go to “Resolving an operating system boot failure”
on page 7.
page 9.
Go to “Resolving a hardware problem” on page
10.
Determining the problem analysis procedure to perform
Learn how to identify the correct problem analysis procedure to perform.
About this task
To determine the correct problem analysis procedure to perform, complete the following steps:
Procedure
1. After you apply power to the system, are the power supply LEDs green (either steady or flashing)?
2. Can you access the baseboard management controller (BMC) across the network?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving a BMC access problem” on page 2.
3. Can you boot the system to the Petitboot menu?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving a system rmware boot failure” on page 5.
4. Is video displayed on the video graphics array (VGA) monitor?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving a VGA monitor problem” on page 7.
5. Can you start the operating system?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving an operating system boot failure” on page 7.
6. On the sensor readings GUI display, are any sensors red?
If
Yes:Go to “Resolving a sensor indicator problem” on page 9.
No:Continue with the next step.
7. Go to “Resolving a hardware problem” on page 10. This ends the procedure.
Then
Resolving a BMC access problem
Learn how to identify the service action that is needed to resolve a baseboard management controller
(BMC) access problem.
Procedure
1. Ensure that the BMC password is not set to the default password. For information about changing the
default password, see Logging on to the BMC GUI. Does the problem persist?
If
Yes:Continue with the next step.
No:This ends the procedure.
Then
2. Are both ends of the network cable seated securely?
If
Yes:Continue with the next step.
No:Seat both ends of the cable securely. If the problem persists, continue with the next
2 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Then
step.
3. Power off the system and disconnect all AC power cords for 30 seconds. Then, reconnect the AC
power cords and power on the system. Does the BMC access problem persist?
IfThen
Yes:Continue with the next step.
No:This ends the procedure.
4. Verify that the BMC network settings are correct.
a) Power on the system by using the power button on the front of the system. Wait 1 - 2 minutes for
the system to display the Petitboot menu.
b) When the Petitboot menu is displayed, press any key to interrupt the boot process. Then, select
Exit to Shell.
c) Type the following command and press Enter:
ipmitool lan print 1
d) Verify that the MAC address and the IP address settings are correct. Then, continue with the next
step.
Note: If the IP address setting is incorrect, go to Conguring the rmware IP address
website (http://www.ibm.com/support/knowledgecenter/linuxonibm/liabw/
liabwenablenetwork.htm). If the MAC address is 00:00:00:00:00:00, go to “Contacting IBM
service and support” on page 61.
5. Are you able to log in to the BMC web interface?
If
Then
Yes:To update the BMC rmware, go to Updating the system rmware by using the BMC.
If the problem persists, go to step “12” on page 4.
No:Continue with the next step.
6. Complete the following steps:
a. Connect a VGA monitor to the system.
b. Press the power button to power on the system.
c. Boot the system to the Petitboot menu. From the Petitboot menu, select Exit to shell.
7. Are you mounting the storage that contains the pUpdate utility and the BMC rmwarele from a
network storage location?
If
Yes:Continue with the next step.
No:Go to step “9” on page 4.
8. To update the BMC rmware by using a network storage location, complete the following steps:
a) Type mkdir /tmp/media and press Enter.
b) Type the following command and press Enter:
mount -t nfs xxx.xxx.xx.xx:/path/of/files /tmp/media, where xxx.xxx.xx.xx is the
IP address of the system to which you want to establish the connection.
c) Type cd /tmp/media and press Enter.
d) To update the BMC rmware, type the following command and press Enter:
Then
./pUpdate -f bmc.bin -i bt, where bmc.bin is the name of the BMC image le.
e) Allow at least 2 minutes for the BMC to reboot. Does the problem persist?
If
Yes:Go to step “12” on page 4.
Then
Beginning troubleshooting and problem analysis 3
IfThen
No:This ends the procedure.
9. Update the BMC rmware by using a USB device. Complete the following steps:
a) Ensure that the USB device is formatted by using the VFATle system.
b) Insert the USB device into the system if you have not already done so.
c) Type mount and press Enter.
Is the following output displayed?
/dev/mapper/sdb1 mounted on /var/petitboot/mnt/dev/sdb1
IfThen
Yes:Continue with the next step.
No:Go to step “11” on page 4.
10. Complete the following steps:
a) Type cd /var/petitboot/mnt/dev/sdb1 and press Enter.
b) To update the BMC rmware, type the following command and press Enter:
./pUpdate -f bmc.bin -i bt, where bmc.bin is the name of the BMC image le.
c) Allow at least 2 minutes for the BMC to reboot. Does the problem persist?
If
Yes:Go to step “12” on page 4.
No:This ends the procedure.
11. Complete the following steps:
a) Type mkdir /tmp/media and press Enter.
b) Type mount /dev/mapper/sdb1 /tmp/media and press Enter.
c) Type cd /tmp/media and press Enter.
d) To update the BMC rmware, type the following command and press Enter:
./pUpdate -f bmc.bin -i bt, where bmc.bin is the name of the BMC image le.
e) Allow at least 2 minutes for the BMC to reboot. Does the problem persist?
If
Yes:Go to step “12” on page 4.
No:This ends the procedure.
12. Replace the system backplane.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page 63
to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
Then
Then
This ends the procedure.
4
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Resolving a power problem
Learn how to identify the service action that is needed to resolve a power problem.
Procedure
1. Is the identify LED on the front of the system flashing red slowly at 0.25 Hz? For more information
about LEDs, see LEDs on the 9006-12P system or LEDs on the 5104-22C, 9006-22C, or 9006-22P
system.
IfThen
Yes:Continue with the next step.
No:No service action is required. This ends the procedure.
2. Perform the following actions, one at a time until the problem is resolved:
a. Ensure that all of the power cords are fully seated in the power supplies.
b. Ensure that the power supply is fully seated in the system.
c. Ensure that the power supply fan is not blocked.
d. Ensure that all of the power cords are fully seated in the power distribution units (PDUs) or wall
outlets.
e. If the power cords are plugged into PDUs, ensure that the PDUs are turned on.
f. Replace the power cords.
g. Replace the power supplies.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.
Resolving a system rmware boot failure
Learn how to identify the service action that is needed to resolve a failure while booting your system
rmware.
Procedure
1. Does the baseboard management controller (BMC) respond to commands and are you able to access
the BMC web interface?
Note: To determine whether the BMC responds to commands, run the following ipmitool command:
ipmitool -I lanplus -U <username> -P <password> -H <bmc ip or bmc hostname> chassis status
If
Then
Yes:Continue with step “3” on page 6.
No:Continue with the next step.
2. Complete the following actions, one at a time, until the problem is resolved:
a. Reset the BMC remotely by entering the following command:
Beginning troubleshooting and problem analysis
5
ipmitool -I lanplus -U <username> -P <password> -H <bmc ip or bmc hostname> mc reset cold
b. Disconnect the power cords from the system for 30 seconds. Reconnect the power cords, wait 5
minutes, and then go to step “1” on page 5.
c. Update the BMC rmware by using the pUpdate command with the block transfer (BT) option:
1) Type mkdir /tmp/media and press Enter.
2) Type the following command and press Enter:
mount -t nfs xxx.xxx.xx.xx:/path/of/files /tmp/media, where xxx.xxx.xx.xx is
the IP address of the system to which you want to establish the connection.
3) Type cd /tmp/media and press Enter.
4) To update the BMC rmware, type the following command and press Enter:
./pUpdate -f bmc.bin -i bt, where bmc.bin is the name of the BMC image le.
5) Allow at least 2 minutes for the BMC to reboot.
d. Replace the system backplane.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.
3. After you pressed the power button, did the system turn on but fail to display the Petitboot menu?
If
Then
Yes:Continue with the next step.
No:This ends the procedure.
4. Complete the following actions, one at a time, until the problem is resolved:
a. Ensure that the TPM card is fully seated.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location.
b. Disconnect the power cords from the system for 30 seconds. Reconnect the power cords, wait 5
minutes, and then go to step “3” on page 6.
c. Update the PNOR rmware. For instructions, see Getting xes.
Note: If your system is a 9006-12P or 9006-22P, the PNOR rmware level must be
V2.12-20190404, or later.
d. Replace the system backplane.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
6
Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
This ends the procedure.
Resolving a VGA monitor problem
Learn how to identify the service action that is needed to resolve a video graphics array (VGA) monitor
problem.
Procedure
1. Is the system powered on and is the VGA monitor connected to the VGA display port, but no video is
displayed?
IfThen
Yes:Continue with the next step.
No:This ends the procedure.
2. Complete the following steps, one at a time until the problem is resolved:
a) Ensure that the VGA cable is properly seated to the server port and to the monitor port.
b) Verify that your monitor and your VGA cable are working properly by testing them on a system that
is known to be working properly. If the monitor or the VGA cable does not work properly, replace it.
c) Verify that the system is powered on by activating a serial over LAN (SOL) session through the
baseboard management controller (BMC). If the system is not active, go to “Resolving a system
rmware boot failure” on page 5.
d) Replace the system backplane.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.
Resolving an operating system boot failure
Learn how to identify the service action that is needed to resolve a failure while booting your operating
system.
Procedure
1. Was the system recently installed, serviced, moved, or upgraded?
If
Yes:Ensure that all cables are properly seated in the connection path to the designated
No:Continue with the next step.
2. Are you booting the operating system from a network location?
If
Then
boot device. This ends the procedure.
Then
Yes:Continue with the next step.
No:Continue with step “4” on page 8.
3. Complete the following actions, one at a time until the problem is resolved:
Beginning troubleshooting and problem analysis
7
a. Ensure that a problem does not exist with the connection to the network location.
b. Ensure that the adapter has a valid IP address for the network.
c. Replace the network adapter.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page
63 to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
4. Petitboot displays all recognized bootable images to use by default. Is the boot image recognized by
Petitboot?
IfThen
Yes:Continue with step “10” on page 9.
No:Select the Petitboot menu option to refresh the boot images. If the problem persists,
continue with the next step.
5. To determine the command to type on the Petitboot command line to verify that the boot drive is
recognized and in optimal status, use Table 1 on page 8.
Table 1. Determine the command to verify that the boot drive is recognized and in optimal status
Boot drive congurationCommands
Virtual drive connected directly to the system
backplane
Physical drive connected directly to the system
backplane
Is the boot drive recognized and in optimal status?
If
Yes:Reinstall the operating system on the boot drive. This ends the procedure.
No:Continue with the next step.
6. Are the drives properly seated in their respective drive bays?
Note:
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page 63
to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
If
Yes:Continue with the next step.
Then
Then
arcconf getconfig 1 LD
arcconf getconfig 1 PD
No:Properly seat the drives in the drive bays. Then, go to step “4” on page 8.
7. Refresh the Petitboot boot options. Is the boot image on the boot drive recognized?
If
Yes:Boot the operating system. Then, continue with step “10” on page 9.
No:Continue with the next step.
8 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Then
8. To determine the command to type on the Petitboot command line to verify that the drives that are
known to be in a RAID array are recognized, use Table 2 on page 9.
Table 2. Determine the command to verify that the drives that are known to be in a RAID array are
recognized
Drive congurationCommands
Drive connected directly to the system
backplane
Are the drives that are known to be in the RAID array recognized?
IfThen
Yes:Reinstall the operating system on the boot drive. This ends the procedure.
No:Continue with the next step.
9. Complete the following actions, one at a time until the physical drives are recognized in the RAID
array:
Note:
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C locations” on page 63
to identify the physical location and the removal and replacement procedure.
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify the physical
location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify the physical
location and the removal and replacement procedure.
a. If the drive is connected directly to the system backplane, ensure that the mini-SAS cable and
SATA cables are properly seated in the disk drive backplane and system backplane.
b. Replace the SAS or SATA cable.
c. If the drive is connected directly to the system backplane, replace the system backplane.
• arcconf getconfig 1 LD
• arcconf getconfig 1 PD
This ends the procedure.
10. Does an operating system error occur during the boot?
If
Yes:Recover the operating system with the tools for the operating system. If that does
No:Reinstall the operating system. This ends the procedure.
Then
not resolve the problem, reinstall the operating system. This ends the procedure.
Resolving a sensor indicator problem
Learn how to resolve a sensor indicator problem.
About this task
To determine whether a service action is required, complete the following procedure:
Note: For more information about sensors, see Sensor readings GUI display.
Procedure
1. If the system is not powered on, boot the system to the operational state. Log in to the BMC web
interface. Then, click Server Health > Sensor Readings.
Are any of the sensor indicator LEDs red?
Beginning troubleshooting and problem analysis
9
• Yes: Continue with the next step.
• No:This ends the procedure.
2. Record the names of any sensors that have a red LED indicator status.
Note: Repeat steps 3 - 6 for every sensor that you record in this step.
3. Use one of the following commands to list the sensor event logs (SELs).
• To list SELs by using an in-band network, enter the following command:
ipmitool sel elist
• To list SELs remotely over the LAN, enter the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel
elist
4. Review the list of SELs and locate the log entry that meets the following criteria:
• The name of any of the sensors you recorded in step 2
.
• A service action keyword is present. For a list of service action keywords, see “Identifying service
action keywords in system event logs” on page 23.
• Asserted is in the description.
Did you identify a log entry that meets the above criteria?
• Yes: Continue with the next step.
• No: Go to “Collecting diagnostic data” on page 60. Then, go to “Contacting IBM service and
support” on page 61. This ends the procedure.
5. Use one of the following options to display the SEL details for the sensor:
Note: You must specify the SEL record ID in hexadecimal format. For example: 0x1a.
• To display SEL details by using an in-band network, enter the following command:
ipmitool sel get <SEL record ID>
• To display SEL details remotely over the LAN, enter the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel get <SEL record ID>
6. The sensor ID eld contains sensor information in the sensor name (sensor ID) format. Record the
sensor name, sensor ID, and event description. Then, use this information to determine the service
action to perform:
• If your system is a 5104-22C, 9006-12P, 9006-22C, or 9006-22P, go to “Identifying a service action
by using sensor and event information for the 5104-22C, 9006-12P, 9006-22C, or 9006-22P” on
page 24 to determine the service action to perform. This ends the procedure.
Resolving a hardware problem
Learn how to identify the service action that is needed to resolve a hardware problem.
Procedure
1. If you have not already done so, manually boot the system.
2. Go to “Identifying a service action by using system event logs” on page 18. Then, continue with the
next step.
3. Was a service action identied?
If
Yes:Continue with the next step.
10 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Then
IfThen
No:Go to step “5” on page 11.
4. Did the service action x the problem?
IfThen
Yes:This ends the procedure.
No:Go to step “5” on page 11.
5. Go to “Resolving a PCIe adapter or device problem” on page 11. Then, continue with the next step.
6. Was a service action identied?
IfThen
Yes:Continue with the next step.
No:Go to “Collecting diagnostic data” on page 60. Then, go to “Contacting IBM service
and support” on page 61. This ends the procedure.
7. Did the service action x the problem?
IfThen
Yes:This ends the procedure.
No:Go to “Collecting diagnostic data” on page 60. Then, go to “Contacting IBM service
and support” on page 61. This ends the procedure.
Resolving a PCIe adapter or device problem
Learn how to access log les, information to identify types of events, and a list of potential problems and
service actions.
About this task
Procedure
1. To identify the correct service procedure to perform by using operating system log information,
complete the following steps:
a) Log in as the root user.
b) At the command prompt, type dmesg and press Enter.
2. Scan the operating system logs for the rst occurrence of keywords, such as fail, failure, or failed.
When you nd a keyword that accompanies one or more of the resource names in Table 3 on page
12, a service action is required.
Did you nd an operating system log that requires a service action?
If
Yes:Use Table 3 on page 12 to determine the service procedure to perform for your type
No:Continue with the next step.
Then
of problem. This ends the procedure.
Beginning troubleshooting and problem analysis 11
Table 3. Resource names, examples, and service procedures for different types of operating system
logs.
Resource nameExample of a log
requiring a service
action
eth1, eth2, eth3,
enPxxxxx, where xxxxx
indicates the network
port.
mlx5_coreLink Down
tg3PCI I/O error
nvmeFailed status:
sda, sdb, sdcFAILED ResultStorageGo to “Resolving a
EEHDetected error on
Failed to reinitialize device
health_care:
handling bad
device here
detected.
Link is Down
ffffffff, reset
controller
PHB#xxx, where xxx is
the PHB number.
Type of problemService procedure
NetworkGo to “Resolving a
network adapter
problem” on page 13.
NetworkGo to “Resolving a
network adapter
problem” on page 13.
NetworkGo to “Resolving a
network adapter
problem” on page 13.
NVMe Flash adapterGo to “Resolving an
NVMe Flash adapter
problem” on page 15.
storage device
problem” on page 15.
PCIe bus or adapterResolve any device
driver errors that are
related to I/O and that
occurred near the time
of this operating system
log entry.
xxx has failed 6
times in the last
hour and has been
permanently
disabled, where xxx
is the PCI bus number.
3. Are all of the adapters in the system missing or failed?
If
Yes:Perform the following actions, one at a time, until the problem is resolved:
Then
a. Ensure that the PCIe risers are fully seated in the system.
b. Replace system processor CPU 1.
c. Replace the system backplane.
• If your system is a 5104-22C or 9006-22C, go to “5104-22C or 9006-22C
locations” on page 63 to identify the physical location and the removal and
replacement procedure.
PCIe bus or adapterEnsure that the correct
device drivers are
properly installed for
the device. If the
problem persists,
replace the adapter in
the PCIe slot that is
specied in the
operating system log
entry.
12 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
IfThen
• If your system is a 9006-12P, go to “9006-12P locations” on page 75 to identify
the physical location and the removal and replacement procedure.
• If your system is a 9006-22P, go to “9006-22P locations” on page 91 to identify
the physical location and the removal and replacement procedure.
No:Go to “Collecting diagnostic data” on page 60. Then, go to “Contacting IBM service
and support” on page 61.
Resolving a network adapter problem
Learn about the possible problems and service actions that you can perform to resolve a network adapter
problem.
About this task
Note: To determine the location of the PCIe adapter, see “Identifying the location of the PCIe adapter by
using the slot number” on page 16.
Table 4. Network adapter problems and service actions
ProblemService action
System is unable to nd the adapter or the
negotiated PCIe bandwidth of the adapter is less
than expected
1. Verify that the adapter is properly seated in a
compatible slot.
2. Install the adapter in a different compatible
slot.
3. Verify that the drivers for the adapter are
installed.
4. Verify that the most recent rmware is installed
on the system, or install the most recent
rmware if it is not already installed.
5. Restart the system.
6. Replace the adapter.
7. If the adapter is connected to a PCIe riser,
replace the PCIe riser.
8. If the adapter is in UIO slot 1, UIO slot 2, or UIO
slot 3, replace CPU 1. Otherwise, replace CPU 2.
9. Replace the system backplane.
Beginning troubleshooting and problem analysis 13
Table 4. Network adapter problems and service actions (continued)
ProblemService action
Adapter suddenly stops working
1. If the system was recently installed, moved,
serviced, or upgraded, verify that the adapter is
seated properly and all associated cables are
correctly connected.
2. Inspect the PCIe socket and verify that there is
no dirt or debris in the socket.
3. Inspect the card and verify that it is not
physically damaged.
4. Verify that all cables are properly seated and
are not physically damaged. If you recently
added one or more new adapters, remove them
and then test to determine whether the failing
adapter is functioning properly again. If the
network adapter is functioning again, review the
IBM support tips to conrm that there are no
PCI address, driver, or rmware conflicts. Then,
reinstall the new adapters again one at a time
until all adapters function properly.
5. Replace the adapter.
6. If the adapter is connected to a PCIe riser,
replace the PCIe riser.
7. If the adapter is in UIO slot 1, UIO slot 2, or UIO
slot 3, replace CPU 1. Otherwise, replace CPU 2.
8. Replace the system backplane.
Link indicator light on the adapter is off
Link light on the adapter is on, but there is no
communication from the adapter
Other problemsFor information about adapter diagnostics, see
1. Verify that the cable functions properly by
testing it with a known working connection.
2. Verify that the port or ports on the switch are
enabled and functional.
3. Verify that the switch and adapter are
compatible.
4. Replace the adapter.
1. Verify that the most recent driver is installed, or
install the most recent driver if it is not already
installed.
2. Verify that the adapter and its link have
compatible settings, such as speed and duplex
conguration.
Supporting diagnostics. For information about
adapter user information, see User guides for PCIe
adapters.
14 Power Systems: Problem analysis, system parts, and locations for the 5104-22C, 9006-12P, 9006-22C, and
9006-22P
Resolving an NVMe Flash adapter problem
Learn about the possible problems and service actions that you can perform to resolve a Non-Volatile
Memory Express (NVMe) Flash adapter problem.
About this task
Note: To determine the location of the NVMe Flash adapter, see “Identifying the location of the NVMe
Flash adapter” on page 17.
Table 5. NVMe Flash adapter problems and service actions
ProblemService action
System is unable
to nd the NVMe
Flash adapter
NVMe Flash
adapter stops
working suddenly
Other problemsCheck the messages and resolve any other problems that are detected. Then, test
1. If the system was recently installed, moved, serviced, or upgraded, verify that the
NVMe Flash adapter is seated and installed properly.
2. Verify that the NVMe Flash adapter is compatible with the system.
3. Verify that the most recent rmware is installed on the system. Otherwise install
the most recent rmware if it is not already installed.
4. Replace the NVMe Flash adapter.
1. Check the system logs to verify whether the system detected a problem.
2. Replace the NVMe Flash adapter.
the NVMe Flash adapter again.
Resolving a storage device problem
Learn about the possible problems and service actions that you can perform to resolve a storage device
problem.
About this task
Note: To determine the location of the storage device, see “Identifying the location of the storage device”
on page 17.
Table 6. Storage device problems and service actions
ProblemService action
System is unable to nd more than one storage
device
1. If the system was recently installed, moved,
serviced, or upgraded, verify that the device is
seated and installed properly.
2. Verify that the device is compatible with your
system.
3. Verify that all internal cables are properly
seated and are not physically damaged.
4. Verify that the most recent rmware is installed
on the system, or install the most recent
rmware if it is not already installed.
5. If the devices are part of a RAID conguration,
ensure that the device has been enabled and is
part of an array.
6. Replace the cable that connects the disk drive
backplane to the system backplane.
Beginning troubleshooting and problem analysis 15
Table 6. Storage device problems and service actions (continued)
ProblemService action
System unable to nd a storage device
More than one storage device suddenly stops
working
1. If the system was recently installed, moved,
serviced, or upgraded, verify that the device is
seated and installed properly.
2. Verify that the device is compatible with your
system.
3. Verify that all internal cables are properly
seated and are not physically damaged.
4. Verify that the most recent rmware is installed
on the system, or install the most recent
rmware if it is not already installed.
5. If the device is part of a RAID conguration,
ensure that the device has been enabled and is
part of an array.
6. Install the device in an open or free slot. If the
device is able to be found replace the
component with the failing connector.
7. Replace the storage device.
8. Replace any applicable attached cable.
1. If the system was recently installed, moved,
serviced, or upgraded, verify that the device is
seated and installed properly.
2. Check the system logs to verify whether the
system detected a problem.
3. Replace the cable that connects the disk drive
backplane to the system backplane.
One storage device suddenly stops working
Other problemsCheck the messages and resolve any other
1. Verify that all internal cables are properly
seated and are not physically damaged.
2. Check the system logs to verify whether the
system detected a problem.
3. Replace the drive.
4. Replace the system backplane.
5. Replace the cable.
problems that were detected. Then, test the drive
again. If the drive continues not to function, refer
to the documentation for the drive.
Identifying the location of the PCIe adapter by using the slot number
The error message provides information to help you to determine the location of the PCIe adapter.
About this task
For example, the log might contain an error similar to the following text: