Problem analysis, system parts, and
locations for the IBM Power System
S822LC (8335-GCA, 8335-GTA, and
8335-GTB), and IBM Power System
S812LC (8348-21C)
IBM
Power Systems
Problem analysis, system parts, and
locations for the IBM Power System
S822LC (8335-GCA, 8335-GTA, and
8335-GTB), and IBM Power System
S812LC (8348-21C)
IBM
Note
Before using this information and the product it supports, read the information in “Safety notices” on page v, “Notices” on
page 145, the IBM Systems Safety Notices manual, G229-9054, and the IBM Environmental Notices and User Guide, Z125–5823.
This edition applies to IBM Power Systems™servers that contain the POWER8®processor and to all associated
models.
Class A Notices................................. 148
Class B Notices ................................. 152
Terms and conditions................................ 155
ivProblem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Safety notices
Safety notices may be printed throughout this guide:
v DANGER notices call attention to a situation that is potentially lethal or extremely hazardous to
people.
v CAUTION notices call attention to a situation that is potentially hazardous to people because of some
existing condition.
v Attention notices call attention to the possibility of damage to a program, device, system, or data.
World Trade safety information
Several countries require the safety information contained in product publications to be presented in their
national languages. If this requirement applies to your country, safety information documentation is
included in the publications package (such as in printed documentation, on DVD, or as part of the
product) shipped with the product. The documentation contains the safety information in your national
language with references to the U.S. English source. Before using a U.S. English publication to install,
operate, or service this product, you must first become familiar with the related safety information
documentation. You should also refer to the safety information documentation any time you do not
clearly understand any safety information in the U.S. English publications.
Replacement or additional copies of safety information documentation can be obtained by calling the IBM
Hotline at 1-800-300-8751.
German safety information
Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne § 2 der
Bildschirmarbeitsverordnung geeignet.
Laser safety information
IBM®servers can use I/O cards or features that are fiber-optic based and that utilize lasers or LEDs.
Laser compliance
IBM servers may be installed inside or outside of an IT equipment rack.
DANGER: When working on or around the system, observe the following precautions:
Electrical voltage and current from power, telephone, and communication cables are hazardous. To avoid
a shock hazard:
v If IBM supplied the power cord(s), connect power to this unit only with the IBM provided power cord.
Do not use the IBM provided power cord for any other product.
v Do not open or service any power supply assembly.
v Do not connect or disconnect any cables or perform installation, maintenance, or reconfiguration of this
product during an electrical storm.
v The product might be equipped with multiple power cords. To remove all hazardous voltages,
disconnect all power cords.
– For AC power, disconnect all power cords from their AC power source.
– For racks with a DC power distribution panel (PDP), disconnect the customer’s DC power source to
the PDP.
v When connecting power to the product ensure all power cables are properly connected.
– For racks with AC power, connect all power cords to a properly wired and grounded electrical
outlet. Ensure that the outlet supplies proper voltage and phase rotation according to the system
rating plate.
– For racks with a DC power distribution panel (PDP), connect the customer’s DC power source to
the PDP. Ensure that the proper polarity is used when attaching the DC power and DC power
return wiring.
v Connect any equipment that will be attached to this product to properly wired outlets.
v When possible, use one hand only to connect or disconnect signal cables.
v Never turn on any equipment when there is evidence of fire, water, or structural damage.
v Do not attempt to switch on power to the machine until all possible unsafe conditions are corrected.
v Assume that an electrical safety hazard is present. Perform all continuity, grounding, and power checks
specified during the subsystem installation procedures to ensure that the machine meets safety
requirements.
v Do not continue with the inspection if any unsafe conditions are present.
v Before you open the device covers, unless instructed otherwise in the installation and configuration
procedures: Disconnect the attached AC power cords, turn off the applicable circuit breakers located in
the rack power distribution panel (PDP), and disconnect any telecommunications systems, networks,
and modems.
DANGER:
v Connect and disconnect cables as described in the following procedures when installing, moving, or
opening covers on this product or attached devices.
To Disconnect:
1. Turn off everything (unless instructed otherwise).
2. For AC power, remove the power cords from the outlets.
3. For racks with a DC power distribution panel (PDP), turn off the circuit breakers located in the
PDP and remove the power from the Customer's DC power source.
4. Remove the signal cables from the connectors.
5. Remove all cables from the devices.
To Connect:
1. Turn off everything (unless instructed otherwise).
2. Attach all cables to the devices.
3. Attach the signal cables to the connectors.
4. For AC power, attach the power cords to the outlets.
5. For racks with a DC power distribution panel (PDP), restore the power from the Customer's DC
power source and turn on the circuit breakers located in the PDP.
6. Turn on the devices.
Sharp edges, corners and joints may be present in and around the system. Use care when handling
equipment to avoid cuts, scrapes and pinching. (D005)
(R001 part 1 of 2):
DANGER: Observe the following precautions when working on or around your IT rack system:
v Heavy equipment–personal injury or equipment damage might result if mishandled.
v Always lower the leveling pads on the rack cabinet.
v Always install stabilizer brackets on the rack cabinet.
v To avoid hazardous conditions due to uneven mechanical loading, always install the heaviest devices
in the bottom of the rack cabinet. Always install servers and optional devices starting from the bottom
of the rack cabinet.
v Rack-mounted devices are not to be used as shelves or work spaces. Do not place objects on top of
rack-mounted devices. In addition, do not lean on rack mounted devices and do not use them to
stabilize your body position (for example, when working from a ladder).
viProblem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
v Each rack cabinet might have more than one power cord.
– For AC powered racks, be sure to disconnect all power cords in the rack cabinet when directed to
disconnect power during servicing.
– For racks with a DC power distribution panel (PDP), turn off the circuit breaker that controls the
power to the system unit(s), or disconnect the customer’s DC power source, when directed to
disconnect power during servicing.
v Connect all devices installed in a rack cabinet to power devices installed in the same rack cabinet. Do
not plug a power cord from a device installed in one rack cabinet into a power device installed in a
different rack cabinet.
v An electrical outlet that is not correctly wired could place hazardous voltage on the metal parts of the
system or the devices that attach to the system. It is the responsibility of the customer to ensure that
the outlet is correctly wired and grounded to prevent an electrical shock.
(R001 part 2 of 2):
CAUTION:
v Do not install a unit in a rack where the internal rack ambient temperatures will exceed the
manufacturer's recommended ambient temperature for all your rack-mounted devices.
v Do not install a unit in a rack where the air flow is compromised. Ensure that air flow is not blocked
or reduced on any side, front, or back of a unit used for air flow through the unit.
v Consideration should be given to the connection of the equipment to the supply circuit so that
overloading of the circuits does not compromise the supply wiring or overcurrent protection. To
provide the correct power connection to a rack, refer to the rating labels located on the equipment in
the rack to determine the total power requirement of the supply circuit.
v (For sliding drawers.) Do not pull out or install any drawer or feature if the rack stabilizer brackets are
not attached to the rack. Do not pull out more than one drawer at a time. The rack might become
unstable if you pull out more than one drawer at a time.
v (For fixed drawers.) This drawer is a fixed drawer and must not be moved for servicing unless specified
by the manufacturer. Attempting to move the drawer partially or completely out of the rack might
cause the rack to become unstable or cause the drawer to fall out of the rack.
Safety noticesvii
CAUTION:
Removing components from the upper positions in the rack cabinet improves rack stability during
relocation. Follow these general guidelines whenever you relocate a populated rack cabinet within a
room or building.
v Reduce the weight of the rack cabinet by removing equipment starting at the top of the rack
cabinet. When possible, restore the rack cabinet to the configuration of the rack cabinet as you
received it. If this configuration is not known, you must observe the following precautions:
– Remove all devices in the 32U position (compliance ID RACK-001 or 22U (compliance ID RR001)
and above.
– Ensure that the heaviest devices are installed in the bottom of the rack cabinet.
– Ensure that there are little-to-no empty U-levels between devices installed in the rack cabinet
below the 32U (compliance ID RACK-001 or 22U (compliance ID RR001) level, unless the
received configuration specifically allowed it.
v If the rack cabinet you are relocating is part of a suite of rack cabinets, detach the rack cabinet from
the suite.
v If the rack cabinet you are relocating was supplied with removable outriggers they must be
reinstalled before the cabinet is relocated.
v Inspect the route that you plan to take to eliminate potential hazards.
v Verify that the route that you choose can support the weight of the loaded rack cabinet. Refer to the
documentation that comes with your rack cabinet for the weight of a loaded rack cabinet.
v Verify that all door openings are at least 760 x 230 mm (30 x 80 in.).
v Ensure that all devices, shelves, drawers, doors, and cables are secure.
v Ensure that the four leveling pads are raised to their highest position.
v Ensure that there is no stabilizer bracket installed on the rack cabinet during movement.
v Do not use a ramp inclined at more than 10 degrees.
v When the rack cabinet is in the new location, complete the following steps:
– Lower the four leveling pads.
– Install stabilizer brackets on the rack cabinet.
– If you removed any devices from the rack cabinet, repopulate the rack cabinet from the lowest
position to the highest position.
v If a long-distance relocation is required, restore the rack cabinet to the configuration of the rack
cabinet as you received it. Pack the rack cabinet in the original packaging material, or equivalent.
Also lower the leveling pads to raise the casters off of the pallet and bolt the rack cabinet to the
pallet.
(R002)
(L001)
DANGER: Hazardous voltage, current, or energy levels are present inside any component that has this
label attached. Do not open any cover or barrier that contains this label. (L001)
(L002)
viiiProblem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
DANGER: Rack-mounted devices are not to be used as shelves or work spaces. (L002)
1
2
!
1
2
12
3
4
(L003)
or
or
or
Safety noticesix
1
2
3
4
or
DANGER: Multiple power cords. The product might be equipped with multiple AC power cords or
multiple DC power cables. To remove all hazardous voltages, disconnect all power cords and power
cables. (L003)
(L007)
CAUTION: A hot surface nearby. (L007)
(L008)
xProblem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
CAUTION: Hazardous moving parts nearby. (L008)
All lasers are certified in the U.S. to conform to the requirements of DHHS 21 CFR Subchapter J for class
1 laser products. Outside the U.S., they are certified to be in compliance with IEC 60825 as a class 1 laser
product. Consult the label on each part for laser certification numbers and approval information.
CAUTION:
This product might contain one or more of the following devices: CD-ROM drive, DVD-ROM drive,
DVD-RAM drive, or laser module, which are Class 1 laser products. Note the following information:
v Do not remove the covers. Removing the covers of the laser product could result in exposure to
hazardous laser radiation. There are no serviceable parts inside the device.
v Use of the controls or adjustments or performance of procedures other than those specified herein
might result in hazardous radiation exposure.
(C026)
CAUTION:
Data processing environments can contain equipment transmitting on system links with laser modules
that operate at greater than Class 1 power levels. For this reason, never look into the end of an optical
fiber cable or open receptacle. Although shining light into one end and looking into the other end of
a disconnected optical fiber to verify the continuity of optic fibers many not injure the eye, this
procedure is potentially dangerous. Therefore, verifying the continuity of optical fibers by shining
light into one end and looking at the other end is not recommended. To verify continuity of a fiber
optic cable, use an optical light source and power meter. (C027)
CAUTION:
This product contains a Class 1M laser. Do not view directly with optical instruments. (C028)
CAUTION:
Some laser products contain an embedded Class 3A or Class 3B laser diode. Note the following
information: laser radiation when open. Do not stare into the beam, do not view directly with optical
instruments, and avoid direct exposure to the beam. (C030)
CAUTION:
The battery contains lithium. To avoid possible explosion, do not burn or charge the battery.
Do Not:
v ___ Throw or immerse into water
v ___ Heat to more than 100°C (212°F)
v ___ Repair or disassemble
Exchange only with the IBM-approved part. Recycle or discard the battery as instructed by local
regulations. In the United States, IBM has a process for the collection of this battery. For information,
call 1-800-426-4333. Have the IBM part number for the battery unit available when you call. (C003)
Safety noticesxi
CAUTION:
Regarding IBM provided VENDOR LIFT TOOL:
v Operation of LIFT TOOL by authorized personnel only.
v LIFT TOOL intended for use to assist, lift, install, remove units (load) up into rack elevations. It is
not to be used loaded transporting over major ramps nor as a replacement for such designated tools
like pallet jacks, walkies, fork trucks and such related relocation practices. When this is not
practicable, specially trained persons or services must be used (for instance, riggers or movers).
v Read and completely understand the contents of LIFT TOOL operator's manual before using.
Failure to read, understand, obey safety rules, and follow instructions may result in property
damage and/or personal injury. If there are questions, contact the vendor's service and support.
Local paper manual must remain with machine in provided storage sleeve area. Latest revision
manual available on vendor's web site.
v Test verify stabilizer brake function before each use. Do not over-force moving or rolling the LIFT
TOOL with stabilizer brake engaged.
v Do not move LIFT TOOL while platform is raised, except for minor positioning.
v Do not exceed rated load capacity. See LOAD CAPACITY CHART regarding maximum loads at
center versus edge of extended platform.
v Only raise load if properly centered on platform. Do not place more than 200 lb (91 kg) on edge of
sliding platform shelf also considering the load's center of mass/gravity (CoG).
v Do not corner load the platform tilt riser accessory option. Secure platform riser tilt option to main
shelf in all four (4x) locations with provided hardware only, prior to use. Load objects are designed
to slide on/off smooth platforms without appreciable force, so take care not to push or lean. Keep
riser tilt option flat at all times except for final minor adjustment when needed.
v Do not stand under overhanging load.
v Do not use on uneven surface, incline or decline (major ramps).
v Do not stack loads.
v Do not operate while under the influence of drugs or alcohol.
v Do not support ladder against LIFT TOOL.
v Tipping hazard. Do not push or lean against load with raised platform.
v Do not use as a personnel lifting platform or step. No riders.
v Do not stand on any part of lift. Not a step.
v Do not climb on mast.
v Do not operate a damaged or malfunctioning LIFT TOOL machine.
v Crush and pinch point hazard below platform. Only lower load in areas clear of personnel and
obstructions. Keep hands and feet clear during operation.
v No Forks. Never lift or move bare LIFT TOOL MACHINE with pallet truck, jack or fork lift.
v Mast extends higher than platform. Be aware of ceiling height, cable trays, sprinklers, lights, and
other overhead objects.
v Do not leave LIFT TOOL machine unattended with an elevated load.
v Watch and keep hands, fingers, and clothing clear when equipment is in motion.
v Turn Winch with hand power only. If winch handle cannot be cranked easily with one hand, it is
probably over-loaded. Do not continue to turn winch past top or bottom of platform travel.
Excessive unwinding will detach handle and damage cable. Always hold handle when lowering,
unwinding. Always assure self that winch is holding load before releasing winch handle.
v A winch accident could cause serious injury. Not for moving humans. Make certain clicking sound
is heard as the equipment is being raised. Be sure winch is locked in position before releasing
handle. Read instruction page before operating this winch. Never allow winch to unwind freely.
Freewheeling will cause uneven cable wrapping around winch drum, damage cable, and may cause
serious injury. (C048)
Power and cabling information for NEBS (Network Equipment-Building System)
GR-1089-CORE
The following comments apply to the IBM servers that have been designated as conforming to NEBS
(Network Equipment-Building System) GR-1089-CORE:
xiiProblem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
The equipment is suitable for installation in the following:
v Network telecommunications facilities
v Locations where the NEC (National Electrical Code) applies
The intrabuilding ports of this equipment are suitable for connection to intrabuilding or unexposed
wiring or cabling only. The intrabuilding ports of this equipment must not be metallically connected to the
interfaces that connect to the OSP (outside plant) or its wiring. These interfaces are designed for use as
intrabuilding interfaces only (Type 2 or Type 4 ports as described in GR-1089-CORE) and require isolation
from the exposed OSP cabling. The addition of primary protectors is not sufficient protection to connect
these interfaces metallically to OSP wiring.
Note: All Ethernet cables must be shielded and grounded at both ends.
The ac-powered system does not require the use of an external surge protection device (SPD).
The dc-powered system employs an isolated DC return (DC-I) design. The DC battery return terminal
shall not be connected to the chassis or frame ground.
The dc-powered system is intended to be installed in a common bonding network (CBN) as described in
GR-1089-CORE.
Safety noticesxiii
xivProblem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Beginning troubleshooting and problem analysis
This information provides a starting point for analyzing problems.
This information is the starting point for diagnosing and repairing systems. From this point, you are
guided to the appropriate information to help you diagnose problems, determine the appropriate repair
action, and then complete the necessary steps to repair the system.
Note: Update the system firmware to the latest level before you start problem analysis. If you update the
system firmware, you will have the latest available fixes and improvements for error handling, reporting,
and isolation. For instructions about updating the system firmware, see Getting fixes.
What type of problem are you dealing with?Problem analysis procedure
You do not know the type of problem.Go to “Determining the problem analysis procedure to
perform.”
A baseboard management controller (BMC) access
problem occurred.
The system does not power on (the power button or the
BMC power on command does not power on the
system).
A system firmware boot failure occurred (the system
started but was not able to boot to the Petitboot menu).
A video graphics array (VGA) monitor problem occurred
(the system started but video is not displayed on the
monitor).
An operating system boot failure occurred (the system
booted to the Petitboot menu but the operating system
did not start).
A BMC dashboard sensor is red.Go to “Resolving a sensor indicator problem” on page
A processor, memory, power, or cooling hardware failure
occurred.
Missing or faulty graphics processing unit (GPU), PCIe
adapter, disk drive, or solid-state drive.
Go to “Resolving a BMC access problem” on page 2.
Go to “Resolving a power problem” on page 3.
Go to “Resolving a system firmware boot failure” on
page 4.
Go to “Resolving a VGA monitor problem” on page 8.
Go to “Resolving an operating system boot failure” on
page 9.
11.
Go to “Resolving a hardware problem” on page 12.
Go to Resolving a GPU, PCIe adapter, or device problem.
Determining the problem analysis procedure to perform
Learn how to identify the correct problem analysis procedure to perform.
To determine the correct problem analysis procedure to perform, complete the following steps:
1. After you apply power to the system, do the power supply LEDs display XXX and after 30 seconds
the power button flashes?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving a power problem” on page 3.
2. Can you access the baseboard management controller (BMC) across the network?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving a BMC access problem.”
3. Can you boot the system to the Petitboot menu?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving a system firmware boot failure” on page 4.
4. Is video displayed on the video graphics array (VGA) monitor?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving a VGA monitor problem” on page 8.
5. Can you start the operating system?
IfThen
Yes:Continue with the next step.
No:Go to “Resolving an operating system boot failure” on page 9.
6. On the BMC dashboard, are any sensors red?
IfThen
Yes:Go to “Resolving a sensor indicator problem” on page 11.
No:Continue with the next step.
7. Go to “Resolving a hardware problem” on page 12. This ends the procedure.
Resolving a BMC access problem
Learn how to identify the service action that is needed to resolve a baseboard management controller
(BMC) access problem.
1. Ensure that the BMC password is not set to the default password. For information about changing the
default password, see Logging on to the BMC GUI. Does the problem persist?
IfThen
Yes:Continue with the next step.
No:This ends the procedure.
2. Are both ends of the network cable seated securely?
IfThen
Yes:Continue with the next step.
No:Seat both ends of the cable securely. If the problem persists, continue with the next step.
3. Power off the system and disconnect all ac power cords for 30 seconds. Then, reconnect the ac power
cords and power on the system. Does the BMC access problem persist?
2Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
IfThen
Yes:Continue with the next step.
No:This ends the procedure.
4. Verify that the BMC network settings are correct.
a. Power on the system by using the power button on the front of the system. Wait 1 - 2 minutes for
the system to display the Petitboot menu.
b. When the Petitboot menu is displayed, press any key to interrupt the boot process. Then, select
Exit to Shell.
c. Type the following command and press Enter:
ipmitool lan print 1
d. Verify that the MAC address and the IP address settings are correct. Then, continue with the next
step.
Note: If the IP address setting is incorrect, go to Configuring the firmware IP address
website(http://www.ibm.com/support/knowledgecenter/linuxonibm/liabw/
liabwenablenetwork.htm). If the MAC address is 00:00:00:00:00:00, go to “Contacting IBM service
and support” on page 110.
5. Complete the following actions:
a. Power on to the Petitboot menu.
b. Use the BMC to update the system firmware. For instructions, see Updating the system firmware
by using the BMC.
Are you able to access the BMC?
IfThen
Yes:This ends the procedure.
No:Continue with the next step.
6. Complete the service action that is indicated for your system:
v If your system is an 8335-GCA or 8335-GTA, replace the system backplane. Go to “8335-GCA and
8335-GTA locations” on page 111 to identify the physical location and the removal and replacement
procedure. This ends the procedure.
v If your system is an 8335-GTB, replace the BMC card. Go to “8335-GTB locations” on page 121 to
identify the physical location and the removal and replacement procedure. This ends the
procedure.
v If your system is an 8348-21C, replace the system backplane. Go to “8348-21C locations” on page
133 to identify the physical location and the removal and replacement procedure. This ends the
procedure.
Resolving a power problem
Learn how to identify the service action that is needed to resolve a power problem.
1. Is the amber LED of a power supply on solid and is the amber LED on the front of the system turned
off?
IfThen
Yes:Ensure that the power cords for both power supplies are fully seated and that the power
distribution units (PDUs) and power outlets are supplying electricity. This ends the
procedure.
No:Continue with the next step.
Beginning troubleshooting and problem analysis3
2. Are the power supply LEDs turned off?
IfThen
Yes:Continue with the next step.
No:Continue with step 4.
3. Perform the following actions, one at a time, until the problem is resolved:
a. Ensure that all of the power cords are fully seated in the power supplies.
b. Ensure that all of the power cords are fully seated in the power distribution units (PDUs) or wall
outlets.
c. If the power cords are plugged into PDUs, ensure that the PDUs are turned on.
d. Ensure that all of the power cords are plugged into PDUs or wall outlets that are supplying
electricity.
e. Replace the power cords.
f. Replace the power supplies.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.
4. Is the amber LED of a power supply on solid and is the red LED on the front of the system flashing
at 0.25 Hz?
IfThen
Yes:Continue with the next step.
No:Go to “Contacting IBM service and support” on page 110. This ends the procedure.
5. Perform the following actions, one at a time, until the problem is resolved:
a. Ensure that the power supply is fully seated in the system.
b. Ensure that the power supply fan is not blocked.
c. Replace the power supply.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.
Resolving a system firmware boot failure
Learn how to identify the service action that is needed to resolve a failure while booting your system
firmware.
1. After you pressed the power button, did the system turn on but fail to display the Petitboot menu?
IfThen
Yes:Continue with the next step.
4Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
IfThen
No:Continue with step 5.
2.Does the baseboard management controller (BMC) respond to commands?
Note: To determine whether the BMC responds to commands, run the following ipmitool command:
ipmitool -I lanplus -U <username> -P <password> -H <bmc ip or bmc hostname> chassis status
IfThen
Yes:Continue with the next step.
No:Continue with step 4.
3. Complete the following actions:
a. Use the BMC to update the system firmware. For instructions, see Updating the system firmware
by using the BMC.
b. Check the system event logs. For instructions, see “Identifying a service action by using system
event logs” on page 27. Then, continue with step 5.
4. Complete the following actions, one at a time, until the problem is resolved:
a. Reset the BMC remotely by entering the following command:
ipmitool -I lanplus -U <username> -P <password> -H <bmc ip or bmc hostname> mc reset cold
b. Disconnect the power cords from the system for 30 seconds. Reconnect the power cords, wait 5
minutes, and then go to step 2.
c. Use the IPMI tool to update the system firmware. For instructions, see Updating the system
firmware by using the IPMI tool.
d. Complete the service action that is indicated for your system:
v If your system is an 8335-GCA or 8335-GTA, replace the system backplane. Go to “8335-GCA
and 8335-GTA locations” on page 111 to identify the physical location and the removal and
replacement procedure.
v If your system is an 8335-GTB, replace the BMC card. Go to “8335-GTB locations” on page 121
to identify the physical location and the removal and replacement procedure.
v If your system is an 8348-21C, replace the system backplane. Go to “8348-21C locations” on
page 133 to identify the physical location and the removal and replacement procedure.
This ends the procedure.
5. Are you here because of a system event log (SEL) with the value OEM record c0 and OEM c0
specific log information 3a1503xxxxxx?
IfThen
Yes:Continue with step 8 on page 6.
No:Continue with the next step.
6. Are you here because of a SEL event with the value OEM record c0 and OEM c0 specific log
information 3a1504xxxxxx?
IfThen
Yes:Continue with step 12 on page 7.
No:Continue with the next step.
7. Power off the system and disconnect all ac power cords for 30 seconds. Then, reconnect the ac
power cords and power on the system. Does the system boot successfully?
Beginning troubleshooting and problem analysis5
IfThen
Yes:This ends the procedure.
No:Go to “Resolving a hardware problem” on page 12. This ends the procedure.
8. Did the system complete the boot process successfully?
IfThen
Yes:Continue with the next step.
No:Continue with step 12 on page 7.
9. Determine whether the system is booted from the user-updated level of the system firmware image
(primary side) or the manufacturing level of the system firmware image (golden side).
v For in-band networks, enter the following command:
ipmitool sensor list | grep -i golden
v To run the command remotely over the LAN, enter the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname>
sensor list | grep -i golden
Do both of the returned records show 0x0080 in the data fields?
IfThen
Yes:The error was temporary. No service action is required. This ends the procedure.
No:One or both of the returned records have 0x0180 in the data fields. The system was booted
from the golden side. Continue with the next step.
10. Search for processor deconfiguration SEL events that have a time stamp in close proximity to the
time stamp of the event with value OEM record c0 that sent you here. Processor deconfiguration
SEL events are displayed in the following form:
v Processor CPU Func x | Transition to Non-recoverable | Asserted
Are processor deconfiguration events present?
IfThen
Yes:Complete the service actions for the processor deconfiguration events.
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using
sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends
the procedure.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and
event information for the 8335-GTB” on page 57. This ends the procedure.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and
event information for the 8348-21C” on page 78. This ends the procedure.
No:Continue with the next step.
11. Are there other types of SEL events that require a service action and have a time stamp in close
proximity to the time stamp of the event with value OEM record c0 that sent you here?
6Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
IfThen
Yes:Complete the service actions for the SEL events that require service actions.
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using
sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends
the procedure.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and
event information for the 8335-GTB” on page 57. This ends the procedure.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and
event information for the 8348-21C” on page 78. This ends the procedure.
No:If the boot problem persists, reload or update the system firmware image. Go to Getting
fixes and reload the system firmware with the same level of firmware or update the system
firmware with a more recent level of firmware. Then, reboot the system. This ends the
procedure.
12. Search for processor deconfiguration SEL events that have a time stamp in close proximity to the
time stamp of the event with value OEM record c0 that sent you here. Processor deconfiguration
SEL events are displayed in the following form:
v Processor CPU Func x | Transition to Non-recoverable | Asserted
Are processor deconfiguration events present?
IfThen
Yes:Complete the service actions for the processor deconfiguration events.
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using
sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends
the procedure.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and
event information for the 8335-GTB” on page 57. This ends the procedure.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and
event information for the 8348-21C” on page 78. This ends the procedure.
No:Continue with the next step.
13. Are there other types of SEL events that require a service action and have a time stamp in close
proximity to the time stamp of the event with value OEM record c0 that sent you here?
IfThen
Yes:Complete the service actions for the SEL events that require service actions.
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using
sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends
the procedure.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and
event information for the 8335-GTB” on page 57. This ends the procedure.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and
event information for the 8348-21C” on page 78. This ends the procedure.
No:Continue with the next step.
14. Power off the system and disconnect all AC power cords for 30 seconds. Then, reconnect the AC
power cords and power on the system. Does the system boot successfully?
IfThen
Yes:This ends the procedure.
No:Continue with the next step.
Beginning troubleshooting and problem analysis7
15. Is the system an 8348-21C, and are all 32 of the DIMM locations populated with 32 GB DIMMs?
IfThen
Yes:Continue with the next step.
No:Go to step 18.
16. Use the baseboard management controller (BMC) to update the system firmware. For instructions,
see Updating the system firmware by using the BMC. Does the problem persist?
IfThen
Yes:Continue with the next step.
No:This ends the procedure.
17. Is your system is an 8335-GTB?
IfThen
Yes:Replace the Baseboard management controller (BMC) card. Go to “8335-GTB locations” on
page 121 to identify the physical location and the removal and replacement procedure. If
the problem persists, continue with the next step. Otherwise, this ends the procedure.
No:Continue with the next step.
18. Replace the system backplane.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure. Then, continue
with the next step.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure. Then, continue with the next step.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure. Then, continue with the next step.
19. Does the problem persist?
IfThen
Yes:Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and
support” on page 110. This ends the procedure.
No:This ends the procedure.
Resolving a VGA monitor problem
Learn how to identify the service action that is needed to resolve a video graphics array (VGA) monitor
problem.
1. Is the system powered on and is the VGA monitor connected to the VGA display port, but video is
not displayed?
IfThen
Yes:Continue with the next step.
No:This ends the procedure.
2. Complete the following steps, one at a time until the problem is resolved:
a. Ensure that the VGA cable is properly seated to the server port and to the monitor port.
8Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
b. Verify that the monitor and the VGA cable are working properly by testing them on a system that
is known to be working properly. If the monitor or the VGA cable does not work properly, replace
it.
c. Verify that the system is powered on by activating a serial over LAN (SOL) session through the
baseboard management controller (BMC). If the system is not active, go to “Resolving a system
firmware boot failure” on page 4.
d. Replace the system backplane.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.
Resolving an operating system boot failure
Learn how to identify the service action that is needed to resolve a failure while booting your operating
system.
1. Was the system recently installed, serviced, moved, or upgraded?
IfThen
Yes:Ensure that all cables are properly seated in the connection path to the designated boot
device. This ends the procedure.
No:Continue with the next step.
2. Are you booting the operating system from a network location?
IfThen
Yes:Continue with the next step.
No:Continue with step 4.
3. Complete the following actions, one at a time, until the problem is resolved:
a. Ensure that a problem does not exist with the connection to the network location.
b. Ensure that the adapter has a valid IP address for the network.
c. Replace the network adapter.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on
page 111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
4. Petitboot displays all recognized bootable images to use by default. Is the boot image recognized by
Petitboot?
IfThen
Yes:Continue with step 11 on page 11.
No:Select the Petitboot menu option to refresh the boot images. If the problem persists,
continue with the next step.
Beginning troubleshooting and problem analysis9
5. Is the system an 8348-21C, and is the boot image on a storage device that is configured in a RAID
configuration?
IfThen
Yes:Continue with the next step.
No:Continue with step 11 on page 11.
6. On the Petitboot command line, type the following command:
arcconf getconfig 1 LD
Is the logical boot drive recognized and in optimal status?
IfThen
Yes:Reinstall the operating system on the logical drive. This ends the procedure.
No:Continue with the next step.
7. Are the drives properly seated in their respective drive bays?
Note:
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
IfThen
Yes:Continue with the next step.
No:Properly seat the drives in the drive bays. Then, go to step 4 on page 9.
8. Refresh the Petitboot boot options. Is the boot image on the logical drive recognized?
IfThen
Yes:Boot the operating system. Then, continue with step 11 on page 11.
No:Continue with the next step.
9. Verify that the physical drives are in the RAID array. On the Petitboot command line, type the
following command:
arcconf getconfig 1 PD
Are the physical drives that are known to be in the RAID array recognized?
IfThen
Yes:Reinstall the operating system on the logical drive. This ends the procedure.
No:Continue with the next step.
10. Complete the following actions, one at a time, until the physical drives are recognized in the RAID
array:
Note:
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure.
10Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
a. Ensure that the SAS cable is securely seated in the RAID adapter and the storage backplane.
b. Replace the RAID adapter.
c. Replace the SAS cable.
This ends the procedure.
11. Does an operating system error occur during the boot?
IfThen
Yes:Recover the operating system with the tools provided for the operating system. If that does
not resolve the problem, reinstall the operating system. This ends the procedure.
No:Reinstall the operating system. This ends the procedure.
Resolving a sensor indicator problem
Learn how to resolve a sensor indicator problem by using the BMC dashboard.
After the system is powered on, some sensors retain their status from the last time the system was
operational. As a result, the sensor indicator LED might not reflect the status of the physical sensor, and
it can be unclear whether the sensor indicator LED indicates an actual problem that requires a service
action. For more information about BMC dashboard sensors on an 8335-GCA or 8335-GTA, see Event
sensor status GUI display. For more information about BMC dashboard sensors on an 8335-GTB, see
Event sensor status GUI display. For more information about BMC dashboard sensors on an 8348-21C,
see Event sensor status GUI display.
To refresh the sensor indicator LEDs and to determine whether a service action is required, complete the
following procedure:
1. Power off the system. Then, boot the system to the operational state. Click Refresh on the BMC
dashboard.
Are any of the sensor indicator LEDs still red?
v Yes: Continue with the next step.
v No: This ends the procedure.
2. Record the names of any sensors that have a red LED indicator status.
Note: Repeat steps 3 - 6 for every sensor that you record in this step.
3. Use one of the following commands to list the sensor event logs (SELs).
v To list SELs by using an in-band network, enter the following command:
ipmitool sel elist
v To list SELs remotely over the LAN, enter the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel elist
4. Review the list of SELs and locate the log entry that meets the following criteria:
v The name of any of the sensors you recorded in step 2.
v A service action keyword is present. For a list of service action keywords, see “Identifying service
action keywords in system event logs” on page 36.
v Asserted is in the description.
Did you identify a log entry that meets the above criteria?
v Yes: Continue with the next step.
Beginning troubleshooting and problem analysis11
v No: Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and
support” on page 110. This ends the procedure.
5. Use one of the following options to display the SEL details for the sensor:
Note: You must specify the SEL record ID in hexadecimal format. For example: 0x1a.
v To display SEL details by using an in-band network, enter the following command:
ipmitool sel get <SEL record ID>
v To display SEL details remotely over the LAN, enter the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel get <SEL record ID>
6. The sensor ID field contains sensor information in the sensor name (sensor ID) format. Record the
sensor name, sensor ID, and event description. Then, use this information to determine the service
action to perform:
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor and
event information for the 8335-GCA and 8335-GTA” on page 37 to determine the service action to
perform. This ends the procedure.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event
information for the 8335-GTB” on page 57 to determine the service action to perform. This ends the
procedure.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event
information for the 8348-21C” on page 78 to determine the service action to perform. This ends the
procedure.
Resolving a hardware problem
Learn how to identify the service action that is needed to resolve a hardware problem.
1. If you have not already done so, manually boot the system.
2. Go to “Identifying a service action by using system event logs” on page 27. Then, continue with the
next step.
3. Was a service action identified?
IfThen
Yes:Continue with the next step.
No:Go to step 5.
4. Did the service action fix the problem?
IfThen
Yes:This ends the procedure.
No:Go to step 5.
5. Go to “Resolving a GPU, PCIe adapter, or device problem” on page 13. Then, continue with the next
step.
6. Was a service action identified?
IfThen
Yes:Continue with the next step.
No:Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and
support” on page 110. This ends the procedure.
7. Did the service action fix the problem?
12Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
IfThen
Yes:This ends the procedure.
No:Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and
support” on page 110. This ends the procedure.
Resolving a GPU, PCIe adapter, or device problem
Learn how to access log files, information to identify types of events, and a list of potential problems and
service actions.
1. Are all of the adapters in the system missing or failed?
IfThen
Yes:Replace the system backplane.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations”
on page 111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the
physical location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the
physical location and the removal and replacement procedure.
No:Continue with the next step.
2. To identify the correct service procedure to perform by using operating system log information,
complete the following steps:
a. Log in as the root user.
b. At the command prompt, type dmesg and press Enter.
3. Scan the operating system logs for the first occurrence of keywords, such as fail, failure, or failed.
When you find a keyword that accompanies one or more of the resource names in the following table,
a service action is required. Use the following table to determine the service procedure to perform for
your type of problem.
Table 1. Resource names, examples, and service procedures for different types of operating system logs.
Example of a log requiring
Resource name
aacraidPCI error detected 2RAID
eth1, eth2, eth3Failed to re-initialize
NVRMaborting RmInitAdapter
nvidia-nvlinkIBMNPU: NPU FENCE
nvmeFailed status: ffffffff,
a service actionType of problemService procedure
Go to “Resolving a RAID
device
failed!
detected, machine power
cycle required
reset controller
Note: This adapter is
available only for 8348-21C
systems.
NetworkGo to “Resolving a network
GraphicsGo to “Resolving a
GraphicsGo to “Resolving a
NVMe Flash adapter
Note: This adapter is
available only for
8335-GCA systems.
Beginning troubleshooting and problem analysis13
adapter problem” on page
14.
adapter problem” on page
15.
graphics processing unit
problem” on page 16.
graphics processing unit
problem” on page 16.
Go to “Resolving an NVMe
Flash adapter problem” on
page 19.
Table 1. Resource names, examples, and service procedures for different types of operating system
logs. (continued)
Example of a log requiring
Resource name
ata1, ata2SError: { RecovComm
sda, sdb, sdcFAILED ResultStorage
a service actionType of problemService procedure
PHYRdyChg 10B8B Dispar }
Marvell storage adapter
Note: This adapter is
available only for 8348-21C
systems.
Go to “Resolving a storage
device problem” on page
20.
Resolving a RAID adapter problem
Learn about the possible problems and service actions that you can perform to resolve a RAID adapter
problem.
Note: To determine the location of the PCIe adapter, see “Identifying the location of the PCIe adapter by
using the slot number” on page 21.
Table 2. RAID adapter problems and service actions.
ProblemService action
System unable to find adapter
Adapter stops working suddenly
1. Verify that the adapter is properly seated in a
compatible slot.
2. Install the adapter in a different compatible slot.
3. Verify that the drivers for the adapter are installed.
4. Verify that the most recent firmware is installed on
the system. Otherwise, install the most recent
firmware if it is not already installed.
5. Restart the system.
6. Replace the adapter.
7. Replace the system backplane.
8. Replace the central processing unit (CPU).
1. If the system was recently installed, moved, serviced,
or upgraded, verify that the adapter is seated
properly and all associated cables are connected
correctly.
2. Inspect the PCIe socket and verify that there is no
dirt or debris in the socket.
3. Inspect the card and verify that it is not physically
damaged.
4. Verify that all cables are properly seated and are not
physically damaged. If you recently added one or
more new adapters, remove them and then test to
determine whether the failing adapter is functioning
properly again. If the RAID adapter is functioning
again, review the IBM support tips to confirm that
there are no PCI address, driver, or firmware
conflicts. Then, reinstall the new adapters again one
at a time until all adapters function properly.
5. Replace the adapter.
6. Replace the system backplane.
7. Replace the CPU.
14Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 2. RAID adapter problems and service actions (continued).
ProblemService action
One or more drives are not recognized
Other problemsFor information about adapter diagnostics, see
1. If more than one drive is not recognized, verify that
the cables are properly attached to the RAID card.
2. Verify that the drive or drives are fully seated in the
system.
3. Verify that all of the cables that attach to the
backplane are properly seated.
4. Verify that the drive or drives are compatible with
the RAID adapter.
5. Verify that the most recent firmware is installed for
the RAID adapter, or install the most recent firmware
if it is not already installed.
6. If more than one drive is not recognized, replace the
drive.
7. Replace the RAID adapter.
8. Replace the system backplane.
9. Replace the cable or cables.
Supporting diagnostics. For information about adapter
user information, see “User guides for GPUs and PCIe
adapters” on page 25.
Resolving a network adapter problem
Learn about the possible problems and service actions that you can perform to resolve a network adapter
problem.
Note: To determine the location of the PCIe adapter, see “Identifying the location of the PCIe adapter by
using the slot number” on page 21.
Table 3. Network adapter problems and service actions.
ProblemService action
System unable to find adapter
1. Verify that the adapter is properly seated in a
compatible slot.
2. Install the adapter in a different compatible slot.
3. Verify that the drivers for the adapter are installed.
4. Verify that the most recent firmware is installed on
the system. Otherwise, install the most recent
firmware if it is not already installed.
5. Restart the system.
6. Replace the adapter.
7. Replace the system backplane.
8. Replace the central processing unit (CPU).
Beginning troubleshooting and problem analysis15
Table 3. Network adapter problems and service actions (continued).
ProblemService action
Adapter stops working suddenly
Link indicator light on the adapter is off
Link light on the adapter is on, but there is no
communication from the adapter
Other problemsFor information about adapter diagnostics, see
1. If the system was recently installed, moved, serviced,
or upgraded, verify that the adapter is seated
properly and all associated cables are correctly
connected.
2. Inspect the PCIe socket and verify that there is no
dirt or debris in the socket.
3. Inspect the card and verify that it is not physically
damaged.
4. Verify that all cables are properly seated and are not
physically damaged. If you recently added one or
more new adapters, remove them and then test to
determine whether the failing adapter is functioning
properly again. If the network adapter is functioning
again, review the IBM support tips to confirm that
there are no PCI address, driver, or firmware
conflicts. Then, reinstall the new adapters again one
at a time until all adapters function properly.
5. Replace the adapter.
6. Replace the system backplane.
7. Replace the CPU.
1. Verify that the cable functions properly by testing it
with a known working connection.
2. Verify that the port or ports on the switch are
enabled and functional.
3. Verify that the switch and adapter are compatible.
4. Replace the adapter.
1. Verify that the most recent driver is installed, or
install the most recent driver if it is not already
installed.
2. Verify that the adapter and its link have compatible
settings, such as speed and duplex configuration.
Supporting diagnostics. For information about adapter
user information, see “User guides for GPUs and PCIe
adapters” on page 25.
Resolving a graphics processing unit problem
Learn about the possible problems and service actions that you can perform to resolve a graphics
processing unit (GPU) problem.
Note: To determine the location of the GPU, see “Identifying the location of the GPU” on page 22.
16Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 4. GPU problems and service actions for the 8335-GCA or 8335-GTA
ProblemService action
System unable to find GPU
1. Verify that the GPU is properly seated in a
compatible slot.
2. Install the GPU in a different compatible slot.
3. Verify that the drivers for the GPU are installed.
4. Verify that the most recent firmware is installed on
the system. Otherwise, install the most recent
firmware if it is not already installed.
5. Restart the system.
6. If the GPU is still missing, replace the following
items, one at a time, until the problem is resolved:
Note: Go to “8335-GCA and 8335-GTA locations” on
page 111 to identify the physical location and the
removal and replacement procedure.
a. GPU
b. System processor modules
c. System backplane
GPU stops working suddenly
1. If the system was recently installed, moved, serviced,
or upgraded, verify that the GPU is seated properly
and all associated cables are connected correctly.
2. Inspect the PCIe socket and verify that there is no
dirt or debris in the socket.
3. Inspect the card and verify that it is not physically
damaged.
4. Verify that all cables are properly seated and are not
physically damaged. If you recently added one or
more new adapters, remove them and then test to
determine whether the failing adapter is functioning
properly again. If the graphics adapter is functioning
again, review the IBM support tips to confirm that
there are no PCI address, driver, or firmware
conflicts. Then, reinstall the new adapters again one
at a time until all adapters function properly.
5. If the GPU is still not working, replace the following
items, one at a time, until the problem is resolved:
Note: Go to “8335-GCA and 8335-GTA locations” on
page 111 to identify the physical location and the
removal and replacement procedure.
a. GPU
b. System processor modules
c. System backplane
Other problemsFor information about adapter diagnostics, see
Supporting diagnostics. For information about adapter
user information, see “User guides for GPUs and PCIe
adapters” on page 25.
Beginning troubleshooting and problem analysis17
Table 5. GPU problems and service actions for the 8335-GTB
ProblemService action
System unable to find GPU
Fence errors in the operating system log
1. Verify that the GPU is properly seated.
2. Verify that the drivers for the GPU are installed.
3. Verify that the most recent firmware is installed on
the system. Otherwise, install the most recent
firmware if it is not already installed.
4. Restart the system.
5. If the GPU is still missing, replace the following
items, one at a time, until the problem is resolved:
Note: Go to “8335-GTB locations” on page 121 to
identify the physical location and the removal and
replacement procedure.
a. GPU
b. System processor modules
c. System backplane
1. Restart the system. Do fence errors continue to be
logged in the operating system log?
v Yes: Continue with the next step.
v No: This ends the procedure.
2. Does NPU chip 0 appear in the fence error log entry?
v Yes: Continue with the next step.
v No: Go to step 4.
3. Replace the following items, one at a time, until the
problem is resolved:
Note: Go to “8335-GTB locations” on page 121 to
identify the physical location and the removal and
replacement procedure.
a. CPU 1
b. GPU 2
c. GPU 1
d. System backplane
This ends the procedure.
4. Does NPU chip 1 appear in the fence error log entry?
v Yes: Continue with the next step.
v No: Go to “Contacting IBM service and support”
on page 110. This ends the procedure.
5. Replace the following items, one at a time, until the
problem is resolved:
Note: Go to “8335-GTB locations” on page 121 to
identify the physical location and the removal and
replacement procedure.
a. CPU 2
b. GPU 4
c. GPU 3
d. System backplane
This ends the procedure.
18Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 5. GPU problems and service actions for the 8335-GTB (continued)
ProblemService action
GPU stops working suddenly
Other problemsFor information about adapter diagnostics, see
1. If the system was recently installed, moved, serviced,
or upgraded, verify that the GPU is seated properly.
2. Inspect the GPU and verify that it is not physically
damaged.
3. If the GPU is still not working, replace the following
items, one at a time, until the problem is resolved:
Note: Go to “8335-GTB locations” on page 121 to
identify the physical location and the removal and
replacement procedure.
a. GPU
b. System processor modules
c. System backplane
Supporting diagnostics. For information about adapter
user information, see “User guides for GPUs and PCIe
adapters” on page 25.
Resolving an NVMe Flash adapter problem
Learn about the possible problems and service actions that you can perform to resolve a Non-Volatile
Memory Express (NVMe) Flash adapter problem.
If you suspect a problem with a PCIe3 1.92 TB CAPI NVMe Flash accelerator adapter (FC EJ1K; CCIN
58CD), see PCIe3 1.92 TB CAPI NVMe Flash Accelerator Adapter (FC EJ1K; CCIN 58CD).
If you suspect a problem with an NVMe Flash adapter, use the following table to determine the service
action to perform.
Note: To determine the location of the NVMe Flash adapter, see “Identifying the location of the NVMe
Flash adapter” on page 23.
Table 6. NVMe Flash adapter problems and service actions
ProblemService action
System is unable to
find the NVMe Flash
adapter
1. If the NVMe Flash adapter has an amber LED that is flashing or is on solid, replace the
adapter. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical
location and removal and replacement procedure.
Important: Before you remove an NVMe Flash adapter, ensure that you back up all data
on the adapter or the array that contains the adapter. After you replace the adapter,
restore the data.
2. If the system was recently installed, moved, serviced, or upgraded, verify that the NVMe
Flash adapter is seated and installed properly.
3. Verify that the NVMe Flash adapter is compatible with the system.
4. Verify that the most recent firmware is installed on the system. Otherwise, install the
most recent firmware if it is not already installed.
5. Replace the NVMe Flash adapter. Go to “8335-GCA and 8335-GTA locations” on page 111
to identify the physical location and removal and replacement procedure.
Important: Before you remove an NVMe Flash adapter, ensure that you back up all data
on the adapter or the array that contains the adapter. After you replace the adapter,
restore the data.
Beginning troubleshooting and problem analysis19
Table 6. NVMe Flash adapter problems and service actions (continued)
ProblemService action
NVMe Flash adapter
stops working
suddenly
Maximum write
capability of an
NVMe Flash adapter
is depleted
Other problems
1. If the NVMe Flash adapter has an amber LED that is flashing or is on solid, replace the
adapter. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical
location and removal and replacement procedure.
Important: Before you remove an NVMe Flash adapter, ensure that you back up all data
on the adapter or the array that contains the adapter. After you replace the adapter,
restore the data.
2. Check the system logs to verify whether the system detected a problem.
3. Replace the NVMe Flash adapter. Go to “8335-GCA and 8335-GTA locations” on page 111
to identify the physical location and removal and replacement procedure.
Important: Before you remove an NVMe Flash adapter, ensure that you back up all data
on the adapter or the array that contains the adapter. After you replace the adapter,
restore the data.
To determine whether the maximum write capability of a PCIe3 1.6 TB NVMe Flash adapter
is depleted, see PCIe3 1.6 TB NVMe Flash adapter (FC EC54; CCIN 58CB). To determine
whether the maximum write capability of a PCIe3 3.2 TB NVMe Flash adapter is depleted,
see PCIe3 3.2 TB NVMe Flash adapter (FC EC56; CCIN 58CC). If you determine that the
adapter must be replaced, go to “8335-GCA and 8335-GTA locations” on page 111 to identify
the physical location and removal and replacement procedure.
Important: Before you remove an NVMe Flash adapter, ensure that you back up all data on
the adapter or the array that contains the adapter. After you replace the adapter, restore the
data.
1. Check for and resolve any nvmeX entries in the operating system log, where nvmeX is
the resource name of the NVMe Flash adapter. Then, test the NVMe Flash adapter again.
2. Ensure that the latest I/O adapter firmware is installed. For instructions, see Getting
firmware fixes for IBM I/O adapters by using Fix Central.
3. Ensure that you have the latest device driver service updates by installing the latest Linux
distribution fixes.
4. Type the following command and press Enter:
nvme smart-log /dev/nvmeX, where nvmeX is the resource name of the NVMe Flash
adapter.
Check for problems with the critical warning, temperature, available spare, percentage
used, power cycles, or power on hours fields.
Note: For more information about nvme commands, type man nvme and press Enter.
Resolving a storage device problem
Learn about the possible problems and service actions that you can perform to resolve a storage device
problem.
Note: To determine the location of the storage device, see “Identifying the location of the storage device”
on page 24.
20Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 7. Storage device problems and service actions
ProblemService action
System is unable to find a storage device that is at the
front of the system
System is unable to find a storage device that is at the
rear of the system (8348-21C only)
Drive stops working suddenly
Other problemsCheck the messages and resolve any other problems that
1. If the system was recently installed, moved, serviced,
or upgraded, verify that the device is seated and
installed properly.
2. Verify that the device is compatible with your system.
3. Verify that all internal cables are properly seated and
are not physically damaged.
4. Verify that the most recent firmware is installed on
the system. Otherwise, install the most recent
firmware if it is not already installed.
5. Replace the drive.
6. If your system is a 8348-21C, replace the system
backplane or the storage mezzanine card.
7. Replace the cable.
8. If you have a RAID adapter installed, replace it.
If the system is unable to find one storage device that is
at the rear of the system, replace the following items, one
at a time until the problem is resolved:
v Drive
v Drive tray
v System backplane
If the system is unable to find more than one storage
device that is at the rear of the system, replace the
following items, one at a time until the problem is
resolved:
v Drive tray
v System backplane
1. Verify that all internal cables are properly seated and
are not physically damaged.
2. Check the system logs to verify whether the system
detected a problem.
3. Replace the drive.
4. If your system is a 8348-21C, replace the system
backplane or the storage mezzanine card.
5. Replace the cable.
6. If you have a RAID adapter that is installed, replace
it.
were detected. Then, test the drive again. If the drive
continues not to function, refer to the documentation for
the drive.
Identifying the location of the PCIe adapter by using the slot number
The error message provides information to help you to determine the location of the PCIe adapter.
For example, the log might contain an error message similar to the following text:
Use the following table to map the slot number information in the operating system log to the PCIe
adapter description and service action.
Table 8. Slot numbers, adapter descriptions, and service action for the 8335-GCA or 8335-GTA.
Slot information from the logPCIe adapter descriptionService action
Slot1PCIe adapter 1Replace the PCIe adapter indicated in
Slot2PCIe adapter 2
Slot3PCIe adapter 3
Slot4PCIe adapter 4
Slot5PCIe adapter 5
Table 9. Slot numbers, adapter descriptions, and service action for the 8335-GTB
Slot information from the logPCIe adapter descriptionService action
Slot1PCIe adapter 1Replace the PCIe adapter indicated in
Slot2PCIe adapter 2
Slot3PCIe adapter 3
Table 10. Slot numbers, adapter descriptions, and service action for the 8348-21C.
Slot information from the logPCIe adapter descriptionService action
Slot1PCIe adapter 1Replace the PCIe adapter indicated in
Slot2PCIe adapter 2
Slot3PCIe adapter 3
Slot4PCIe adapter 4
the PCIe adapter description column.
Go to “8335-GCA and 8335-GTA
locations” on page 111 to identify the
physical location and the removal
and replacement procedure.
the PCIe adapter description column.
Go to “8335-GTB locations” on page
121 to identify the physical location
and the removal and replacement
procedure.
the PCIe adapter description column.
Go to “8348-21C locations” on page
133 to identify the physical location
and the removal and replacement
procedure.
Identifying the location of the GPU
The error message provides information to help you to determine the location of the graphics processing
unit (GPU).
On an 8335-GCA or 8335-GTA system, the log might contain an error message similar to the following
text:
EEH: PHB#0 failure detected, location: Slot5
On an 8335-GTB system, the log might contain an error message similar to the following text:
EEH: PHB#0 failure detected, location: GPU1
If you have an 8335-GTB system with Red Hat Enterprise Linux 7.4 or later, and if you get an error
message with only PCI bus information (for example, 0002:01:00.0), you can determine the GPU slot
information by using the lshw command. Complete the following steps:
1. Record the PCI bus information that is in the error message.
2. Log in to the operating system with root authority.
3. Type the following command and press Enter:
lshw -class display
4. Determine the GPU slot that is associated with the PCI bus information that you recorded in step 1.
22Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Use the following table to map the slot or GPU number information in the operating system log to the
GPU description and service action. This ends the procedure.
Table 11. Slot numbers, GPU descriptions, and service action for the 8335-GCA or 8335-GTA
Slot number information from the
logGPU descriptionService action
Slot5GPU 2Replace the GPU indicated in the
Slot2GPU 1
Table 12. GPU numbers, GPU descriptions, and service action for the 8335-GTB
GPU number information from the
logGPU descriptionService action
GPU1GPU 1Replace the GPU indicated in the
GPU2GPU 2
GPU3GPU 3
GPU4GPU 4
GPU description column. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and the removal and
replacement procedure.
GPU description column. Go to
“8335-GTB locations” on page 121 to
identify the physical location and the
removal and replacement procedure.
Identifying the location of the NVMe Flash adapter
Use this procedure to identify the location of a Non-Volatile Memory Express (NVMe) Flash adapter.
1. Does the operating system log contain the slot number? For example, the log might contain an error
IfThen
Yes:If your system is an 8335-GCA, use Table 13 on page 24 to map the slot number
information in the operating system log to the PCIe adapter description and service action.
If your system is an 8335-GTB, use Table 14 on page 24 to map the slot number information
in the operating system log to the PCIe adapter description and service action. This ends
the procedure.
No:Continue with the next step.
2.Locate the NVMe Flash adapter by using the PCI address:
a. The operating system log contains information about the NVMe Flash adapter in the form of a PCI
address. Record the PCI address information for the NVMe Flash adapter that has failed. For
example, in the operating system log message nvme 0006:01:00.0: Failed status: ffffffff,reset controller, the PCI address of the failing NVMe Flash adapter is 0006:01:00.0.
b. At the command line, type lscfg -vl pciaddress, where pciaddress is the NVMe Flash adapter
information that you recorded in step 2.a. Then, press Enter.
c. Record the slot number information that is in the location code field.
d. If your system is an 8335-GCA, use Table 13 on page 24 to map the slot number information to the
PCIe adapter description and service action. If your system is an 8335-GTB, use Table 14 on page
24 to map the slot number information to the PCIe adapter description and service action. This
ends the procedure.
Beginning troubleshooting and problem analysis23
Table 13. Slot numbers, adapter descriptions, and service action for the 8335-GCA
Slot information from
the logPCIe adapter descriptionService action
Slot1PCIe adapter 1Replace the NVMe Flash adapter indicated in the PCIe
Slot3PCIe adapter 3
Slot4PCIe adapter 4
Table 14. Slot numbers, adapter descriptions, and service action for the 8335-GTB
Slot information from
the logPCIe adapter descriptionService action
Slot1PCIe adapter 1Replace the NVMe Flash adapter indicated in the PCIe
Slot2PCIe adapter 2
Slot3PCIe adapter 3
adapter description column. Go to “8335-GCA and
8335-GTA locations” on page 111 to identify the physical
location and the removal and replacement procedure.
adapter description column. Go to “8335-GTB locations” on
page 121 to identify the physical location and the removal
and replacement procedure.
Identifying the location of the storage device
Use this procedure to identify the location of a storage device.
1. Is there a disk drive or solid-state drive with an amber fault LED turned on solid?
IfThen
Yes:Continue with step 2.
No:Continue with step 3.
2. Replace the disk drive or solid-state drive.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page 111
to identify the removal and replacement procedure. This ends the procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the removal and
replacement procedure. This ends the procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the removal and
replacement procedure. This ends the procedure.
3. Is the system an 8335-GCA, 8335-GTA, or 8335-GTB?
IfThen
Yes:Continue with step 4.
No:Continue with step 5.
4. The storage device location is determined in the drive removal and replacement procedures for your
system. Use the following table to find the correct removal and replacement procedure. This ends the
procedure.
Table 15. Drive removal and replacement procedures
SystemDrive removal and replacement procedures
8335-GCA or 8335-GTASee Removing and replacing a disk drive in the
8335-GCA or 8335-GTA with the system power turned
on.
8335-GTBSee Removing and replacing a disk drive in the
8335-GTB.
5. The system is an 8348-21C. Are the devices controlled by a RAID adapter?
24Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
IfThen
Yes:Continue with step 6.
No:Continue with step 9.
6. To locate the device by using the identify LED, complete the following steps:
a. The operating system log contains information about the device in the form sdx, where x is the
letter associated with the drive that failed. Record the sdx information for the device that failed.
For example, the failing device in the following operating system log is sdb:[ 2614.698832]
blk_update_request: I/O error, dev sdb, sector 131072
b. At the command prompt, type hdparm -i /dev/sdx, where sdx is the device information recorded
in step 6a. Then, press Enter.
c. Record the serial number of the device.
d. At the command prompt, type arcconf getconfig 1 PD and press Enter. Find the reported channel
and device numbers for the device that has the same serial number that you recorded in the
previous step. Record the reported channel and device numbers.
e. At the command prompt, type arcconf identify 1 device x y start, where x is the reported
channel number and y is the reported device number that you recorded in the previous step.
Then, press Enter.
Is the identify LED for one of the devices flashing?
IfThen
Yes:Continue with the next step.
No:Continue with step 9.
7.Replace the device with the flashing identify LED. Go to “8348-21C locations” on page 133 to identify
the removal and replacement procedure. After you have replaced the device, continue with the next
step.
8. At the command prompt, type arcconf identify 1 device x y stop, where x is the reported channel
number and y is the reported device number that you recorded in step 6d. Then, press Enter. This
ends the procedure.
9. To locate the device by using the device serial number, complete the following steps:
a. The operating system log contains information about the device in the form sdx, where x is the
letter associated with the drive that failed. Record the sdx information for the device that failed.
For example, the failing device in the following operating system log is sdb:[ 2614.698832]
blk_update_request: I/O error, dev sdb, sector 131072
b. At the command prompt, type hdparm -i /dev/sdx, where sdx is the device information recorded
in step 9a. Then, press Enter.
c. Record the serial number of the device.
d. Power off the system. Remove one device at a time until you identify the device with the serial
number identified in step 9c. Replace only the device with the matching serial number. Reinstall
the other devices. Go to “8348-21C locations” on page 133 to identify the removal and replacement
procedure. This ends the procedure.
User guides for GPUs and PCIe adapters
Use this information to find the user guide for your graphics processing unit (GPU) or PCIe adapter.
Use the following table to find the user guide for the GPU or PCIe adapter that you are using.
Resolving an over temperature problem for a water-cooled 8335-GTB
system
Learn how to identify the service action that is needed to resolve an over temperature problem.
1. Go to Water cooling system specification and requirements. Are all of the requirements for
water-cooled systems met?
Note: For information specific to the 8335-GTB, see Model 8335-GTB water cooling option (Feature
code E2RD).
IfThen
Yes:Continue with the next step.
No:Work with the customer to ensure that all of the requirements for water-cooled systems are
met. This ends the procedure.
2. Is the room temperature less than 40°C (104°F)?
IfThen
Yes:Continue with the next step.
No:Notify the customer. The customer must bring the room temperature within normal range.
Continue with the next step.
3. Ensure that the following requirements are met:
a. The quick-connects between the 8335-GTB system and the water manifold are mated and
connected to the proper circuits of the manifold. The supply hose must be connected to the supply
manifold circuit, which is the manifold circuit that is located toward the inside of the rack. The
return hose must be connected to the return manifold circuit, which is the manifold circuit that is
located toward the outside of the rack.
b. The facility water supply hose is properly connected to the supply hose on the manifold and the
return hose on the manifold is properly connected to the facility water return hose.
v The ball valves that connect the facility water supply hose to the manifold supply hose and the
facility water return hose to the manifold return hose are open. For more information about
connecting the facility water hoses to the manifold hoses, see Replacing the water manifold in
the 8335-GTB.
26Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
v All of the valves that might restrict the flow of water through the hoses are open in the facility
water system.
v The pumping unit of the facility water system is on and does not have errors.
c. The facility water system is supplying water at the required temperature and flow. For
instructions, see Model 8335-GTB water cooling option (Feature code E2RD).
Does the problem persist?
IfThen
Yes:Continue with the next step.
Note: Steps 1- 3 resolve most problems. Ensure that you carefully check steps 1 - 3 before
you continue with the next step.
No:This ends the procedure.
4. Is a processor over heating, but the other processor and the graphics processing units (GPUs) are not
over heating?
IfThen
Yes:Check the thermal interface material (TIM) between the cold plate and the processor that is
over heating. Go to Removing a system processor module from a water-cooled 8335-GTB
system and complete the steps to lift the cold plate off the processor. If the TIM pad is
damaged, replace the TIM pad. To replace a TIM pad, go to Replacing a system processor
module in a water-cooled 8335-GTB system and complete the steps for removing and
installing a new TIM pad. This ends the procedure.
No:Continue with the next step.
5. Is a GPU over heating, but the other GPUs and the processors are not over heating?
IfThen
Yes:Replace the thermal interface material (TIM) between the cold plate and the GPU that is
over heating. Go to Removing the graphics processing unit from a water-cooled 8335-GTB
system and complete the steps to lift the cold plate off the GPU. Then, go to Replacing the
graphics processing unit in a water-cooled 8335-GTB system and complete the steps for
installing a new TIM pad. If the problem is not resolved, replace the GPU. For instructions
about replacing a GPU, see Removing and replacing a graphics processing unit in the
8335-GTB. This ends the procedure.
No:Continue with the next step.
6. Replace the cold plates. For instructions about how to replace the cold plates, see Removing and
replacing the cold plates in the 8335-GTB. Does the problem persist?
IfThen
Yes:Go to “Contacting IBM service and support” on page 110. This ends the procedure.
No:This ends the procedure.
Identifying a service action
Use the following procedures to help you identify the service action that is needed.
Identifying a service action by using system event logs
Use the Intelligent Platform Management Interface (IPMI) program to examine system event logs (SELs)
to identify a service action.
1. Use the ipmitool command to examine SELs.
Beginning troubleshooting and problem analysis27
v To list SELs by using an in-band network, use the following command:
ipmitool sel elist
v To list SELs remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel elist
2. Scan the SELs for an event with the value OEM record de. Did you find a SEL with the value OEM
record de?
IfThen
Yes:Continue with the next step.
NoGo to step 4 on page 29.
3. The OEM record de specific log information is indicated by the rightmost digits of the SEL with the
value OEM record de. Use Table 17 to determine the service action to perform.
Table 17. OEM record de specific log information and service action
OEM record de specific log informationService action
00xxxxxxxxxxGo to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this
SEL event continues to be logged, go to “Collecting
diagnostic data” on page 109. Then, go to “Contacting
IBM service and support” on page 110.
01xxxxxxxxxxGo to the “EPUB_PRC_FIND_DECONFIGURE_PART
isolation procedure” on page 96.
04xxxxxxxxxxGo to the “EPUB_PRC_SP_CODE isolation procedure”
on page 97.
05xxxxxxxxxxGo to the “EPUB_PRC_PHYP_CODE isolation
procedure” on page 97.
08xxxxxxxxxxGo to the “EPUB_PRC_ALL_PROCS isolation procedure”
on page 98.
09xxxxxxxxxxGo to the “EPUB_PRC_ALL_MEMCRDS isolation
procedure” on page 98.
0AxxxxxxxxxxGo to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this
SEL event continues to be logged, go to “Collecting
diagnostic data” on page 109. Then, go to “Contacting
IBM service and support” on page 110.
10xxxxxxxxxxGo to the “EPUB_PRC_LVL_SUPPORT isolation
procedure” on page 99.
16xxxxxxxxxxGo to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this
SEL event continues to be logged, go to “Collecting
diagnostic data” on page 109. Then, go to “Contacting
IBM service and support” on page 110.
1CxxxxxxxxxxGo to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this
SEL event continues to be logged, go to “Collecting
diagnostic data” on page 109. Then, go to “Contacting
IBM service and support” on page 110.
22xxxxxxxxxxGo to the “EPUB_PRC_MEMORY_PLUGGING_ERROR
isolation procedure” on page 100.
2DxxxxxxxxxxGo to the “EPUB_PRC_FSI_PATH isolation procedure”
on page 100.
28Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 17. OEM record de specific log information and service action (continued)
OEM record de specific log informationService action
30xxxxxxxxxxGo to the “EPUB_PRC_PROC_AB_BUS isolation
procedure” on page 101.
31xxxxxxxxxxGo to the “EPUB_PRC_PROC_XYZ_BUS isolation
procedure” on page 101.
34xxxxxxxxxxGo to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this
SEL event continues to be logged, go to “Collecting
diagnostic data” on page 109. Then, go to “Contacting
IBM service and support” on page 110.
37xxxxxxxxxxGo to the “EPUB_PRC_EIBUS_ERROR isolation
procedure” on page 102.
3FxxxxxxxxxxGo to the “EPUB_PRC_POWER_ERROR isolation
procedure” on page 103.
4DxxxxxxxxxxGo to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this
SEL event continues to be logged, go to “Collecting
diagnostic data” on page 109. Then, go to “Contacting
IBM service and support” on page 110.
4FxxxxxxxxxxGo to the “EPUB_PRC_MEMORY_UE isolation
procedure” on page 104.
55xxxxxxxxxxGo to the “EPUB_PRC_HB_CODE isolation procedure”
on page 104.
56xxxxxxxxxxGo to the “EPUB_PRC_TOD_CLOCK_ERR isolation
procedure” on page 106.
5CxxxxxxxxxxGo to the “EPUB_PRC_COOLING_SYSTEM_ERR
isolation procedure” on page 106.
5ExxxxxxxxxxGo to the “EPUB_PRC_GPU_ISOLATION_PROCEDURE
isolation procedure” on page 107.
This ends the procedure.
4. Scan the SELs for an event with the value OEM record df. Did you find a SEL with the value OEM
record df?
IfThen
Yes:Continue with the next step.
NoGo to step 10 on page 31.
5. One or more events might be logged around the same time as the event with the value OEM record
df. These events require a service action if they meet the following criteria:
v A service action keyword is present. For a list of service action keywords, see “Identifying service
action keywords in system event logs” on page 36.
v Asserted is in the description.
v OEM record is not in the description.
v The event has a time stamp in close proximity to the time stamp of the event with the value OEM
record df.
6. Did you find any SEL events that require a service action as defined in step 5?
IfThen
Yes:Continue with the next step.
Beginning troubleshooting and problem analysis29
IfThen
No:Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and
support” on page 110.
7. Did you find only one SEL event that requires a service action as defined in step 5 on page 29?
IfThen
Yes:Continue with the next step.
No:Go to step 9.
8. Record the SEL record ID for the event you identified in step 5 on page 29. The SEL record ID is
indicated by the leftmost digits of the SEL. Use the ipmitool command to display the SEL details.
v To display SEL details by using an in-band network, use the following command:
ipmitool sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
v To display SEL details remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel
get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
The sensor ID field contains sensor information in the format sensor name (sensor ID). Record the
sensor name, sensor ID, and event description. Then, use the following information to determine the
service action to perform:
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor
and event information for the 8335-GCA and 8335-GTA” on page 37.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event
information for the 8335-GTB” on page 57.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event
information for the 8348-21C” on page 78.
This ends the procedure.
9. You identified more than one event in step 5 on page 29. The service actions for all of the events that
were identified in step 5 on page 29 must be performed to successfully complete the repair. Record
the SEL record IDs for the events that you identified in step 5 on page 29. The SEL record ID is
indicated by the leftmost digits of the SEL. Use the ipmitool command to display SEL details for
each SEL record ID that you recorded.
v To display SEL details by using an in-band network, use the following command:
ipmitool sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
v To display SEL details remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel
get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
The sensor ID field contains sensor information in the format sensor name (sensor ID). Record the
sensor name, sensor ID, and event description. Then, use this information to determine the service
action to perform:
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor
and event information for the 8335-GCA and 8335-GTA” on page 37.
30Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event
information for the 8335-GTB” on page 57.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event
information for the 8348-21C” on page 78.
This ends the procedure.
10. Scan the SEL for an event with the value OEM record c0.
11. Did you find an event with the value OEM record c0?
IfThen
Yes:Continue with the next step.
No:Go to step 13 on page 35.
12. The OEM record c0 specific log information is indicated by the rightmost digits of the SEL with the
value OEM record c0. If your system is an 8335-GCA or 8335-GTA, use Table 18 to determine the
service action to perform. If your system is an 8335-GTB, use Table 19 on page 32 to determine the
service action to perform. If your system is an 8348-21C, use Table 20 on page 34 to determine the
service action to perform.
Table 18. OEM record c0 specific log information, description, and service action for an 8335-GCA or 8335-GTA
OEM record c0 specific log
informationDescriptionService action
320a01xxxxxxPhy read failureIf you are viewing this event from
320a02xxxxxxPhy speed and duplex failure
320exxxxxxxxOCC reset requiredThis event is for information only. No
3a0400xxxxxxChassis soft power offA user initiated power off request
3a0402xxxxxxChassis soft reboot
3a0701xxxxxxRequest for PNOR accessThis event is for information only. No
3a0702xxxxxxRelease of PNOR access
3a1100xxxxxxFan thread stopped
3a1101xxxxxxFan thread started
3a1503xxxxxxPrimary side boot failedGo to “Resolving a system firmware
3a1504xxxxxxGolden side boot failedGo to “Resolving a system firmware
3a1601xxxxxxFan 1 failureReplace Fan 1. Go to “8335-GCA and
3a1602xxxxxxFan 2 failureReplace Fan 2. Go to “8335-GCA and
the BMC, the missing or defective
cable is now operational and no
service action is required. Otherwise,
replace the missing or failed LAN
cable that attaches the console to the
system.
service action is required.
occurred. No service action is
required.
service action is required.
boot failure” on page 4.
boot failure” on page 4.
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
Beginning troubleshooting and problem analysis31
Table 18. OEM record c0 specific log information, description, and service action for an 8335-GCA or
8335-GTA (continued)
OEM record c0 specific log
informationDescriptionService action
3a1603xxxxxxFan 3 failureReplace Fan 3. Go to “8335-GCA and
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
3a1604xxxxxxFan 4 failureReplace Fan 4. Go to “8335-GCA and
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
3a260xyyyyyy, where x = 1, 2, or 3System shut down due to one or
more missing or failed fans
3a2604yyyyyyAll of the fans are missing or failedEnsure that the fan power cable and
The OEM record c0 specific log
information is 3a260xyyyyyy, where x
is the number of fans that were
missing or failed when the system
was shut down. The system cannot
be powered on with missing fans. If
any SEL events were logged with
OEM record c0 specific log
information 3a16xxxxxxxx, complete
the service action indicated in this
table. Otherwise, replace the fans,
one at a time, until the problem is
resolved. Go to “8335-GCA and
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
the disk and fan signal cable are
seated properly. If the problem
persists, replace the following items,
one at a time, until the problem is
resolved:
Note: Go to “8335-GCA and
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
v Power riser with time-of-day
battery slot
v Fan power cable
v Disk and fan signal cable
v Disk drive and fan card
Table 19. OEM record c0 specific log information, description, and service action for an 8335-GTB
OEM record c0 specific log
informationDescriptionService action
320a01xxxxxxPhy read failureIf you are viewing this event from
320a02xxxxxxPhy speed and duplex failure
the BMC, the missing or defective
cable is now operational and no
service action is required. Otherwise,
replace the missing or failed LAN
cable that attaches the console to the
system.
32Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 19. OEM record c0 specific log information, description, and service action for an 8335-GTB (continued)
OEM record c0 specific log
informationDescriptionService action
320exxxxxxxxOCC reset requiredThis event is for information only. No
service action is required.
3a0400xxxxxxChassis soft power offA user initiated power off request
3a0402xxxxxxChassis soft reboot
occurred. No service action is
required.
3a0701xxxxxxRequest for PNOR accessThis event is for information only. No
3a0702xxxxxxRelease of PNOR access
service action is required.
3a1100xxxxxxFan thread stopped
3a1101xxxxxxFan thread started
3a1503xxxxxxPrimary side boot failedGo to “Resolving a system firmware
boot failure” on page 4.
3a1504xxxxxxGolden side boot failedGo to “Resolving a system firmware
boot failure” on page 4.
3a1601xxxxxxFan 1 failureReplace Fan 1. Go to “8335-GTB
locations” on page 121 to identify the
physical location and removal and
replacement procedure.
3a1602xxxxxxFan 2 failureReplace Fan 2. Go to “8335-GTB
locations” on page 121 to identify the
physical location and removal and
replacement procedure.
3a1603xxxxxxFan 3 failureReplace Fan 3. Go to “8335-GTB
locations” on page 121 to identify the
physical location and removal and
replacement procedure.
3a1604xxxxxxFan 4 failureReplace Fan 4. Go to “8335-GTB
locations” on page 121 to identify the
physical location and removal and
replacement procedure.
3a2600xxxxxxThe water-cooled system shut down
due to too many processor core
sensors reading a temperature at or
above the maximum temperature that
is allowed.
3a260xyyyyyy, where x = 1, 2, or 3System shut down due to one or
more missing or failed fan
At least one processor is over
heating. Go to “Resolving an over
temperature problem for a
water-cooled 8335-GTB system” on
page 26.
The OEM record c0 specific log
information is 3a260xyyyyyy where x
is the number of fans that were
missing or failed when the system
was shut down. The system cannot
be powered on with missing fans. If
any SEL events were logged with
OEM record c0 specific log
information 3a16xxxxxxxx, complete
the service action indicated in this
table. Otherwise, replace the fans,
one at a time, until the problem is
resolved. Go to “8335-GTB locations”
on page 121 to identify the physical
location and removal and
replacement procedure.
Beginning troubleshooting and problem analysis33
Table 19. OEM record c0 specific log information, description, and service action for an 8335-GTB (continued)
OEM record c0 specific log
informationDescriptionService action
3a2604yyyyyyAll of the fans are missing or failedEnsure that the fan power cable and
the disk and fan signal cable are
seated properly. If the problem
persists, replace the following items,
one at a time, until the problem is
resolved:
Note: Go to “8335-GTB locations” on
page 121 to identify the physical
location and removal and
replacement procedure.
v Power riser with time-of-day
battery slot
v Fan power cable
v Disk and fan signal cable
v Disk drive and fan card
Table 20. OEM record c0 specific log information, description, and service action for an 8348-21C
OEM record c0 specific log
informationDescriptionService action
320a01xxxxxxPhy read failureIf you are viewing this event from
320a02xxxxxxPhy speed and duplex failure
320exxxxxxxxOCC reset requiredThis event is for information only. No
3a0400xxxxxxChassis soft power offA user initiated power off request
3a0402xxxxxxChassis soft reboot
3a0701xxxxxxRequest for PNOR accessThis event is for information only. No
3a0702xxxxxxRelease of PNOR access
3a1100xxxxxxFan thread stopped
3a1101xxxxxxFan thread started
3a1503xxxxxxPrimary side boot failedGo to “Resolving a system firmware
3a1504xxxxxxGolden side boot failedGo to “Resolving a system firmware
3a1601xxxxxxFan 1 failureReplace Fan 1. Go to “8348-21C
3a1602xxxxxxFan 2 failureReplace Fan 2. Go to “8348-21C
the BMC, the missing or defective
cable is now operational and no
service action is required. Otherwise,
replace the missing or failed LAN
cable that attaches the console to the
system.
service action is required.
occurred. No service action is
required.
service action is required.
boot failure” on page 4.
boot failure” on page 4.
locations” on page 133 to identify the
physical location and removal and
replacement procedure.
locations” on page 133 to identify the
physical location and removal and
replacement procedure.
34Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 20. OEM record c0 specific log information, description, and service action for an 8348-21C (continued)
OEM record c0 specific log
informationDescriptionService action
3a1603xxxxxxFan 3 failureReplace Fan 3. Go to “8348-21C
locations” on page 133 to identify the
physical location and removal and
replacement procedure.
3a1604xxxxxxFan 4 failureReplace Fan 4. Go to “8348-21C
locations” on page 133 to identify the
physical location and removal and
replacement procedure.
3a1605xxxxxxFan 5 failureReplace Fan 5. Go to “8348-21C
locations” on page 133 to identify the
physical location and removal and
replacement procedure.
3a260xyyyyyy, where x = 1, 2, 3, or 4System shut down due to one or
more missing or failed fans
3a2605yyyyyyAll of the fans are missing or failedReplace the disk drive backplane. Go
The OEM record c0 specific log
information is 3a260xyyyyyy, where x
is the number of fans that were
missing or failed when the system
was shut down. The system cannot
be powered on with missing or failed
fans. If any SEL events were logged
with OEM record c0 specific log
information 3a16xxxxxxxx, complete
the service action indicated in this
table. Otherwise, replace the fans,
one at a time, until the problem is
resolved. Go to “8348-21C locations”
on page 133 to identify the physical
location and removal and
replacement procedure.
to “8348-21C locations” on page 133
to identify the physical location and
removal and replacement procedure.
13. One or more SEL events might require a service action. These events require a service action if they
meet the following criteria:
v A service action keyword is present. For a list of service action keywords, see “Identifying service
action keywords in system event logs” on page 36.
v Asserted is in the description.
v OEM record is not in the description.
14. Did you find one or more SEL events that require a service action as defined in step 13?
IfThen
Yes:Continue with the next step.
No:This ends the procedure.
15. The service actions for all of the events that were identified in step 13 must be performed to
successfully complete the repair. Record the SEL record IDs for the events that you identified in step
13. The SEL record ID is indicated by the leftmost digits of the SEL. Use the ipmitool command to
display SEL details for each SEL record ID that you recorded.
v To display SEL details by using an in-band network, use the following command:
ipmitool sel get <SEL record ID>
Beginning troubleshooting and problem analysis35
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
v To display SEL details remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel
get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
The sensor ID field contains sensor information in the format sensor name (sensor ID). Record the
sensor name, sensor ID, and event description. Then, use this information to determine the service
action to perform:
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor
and event information for the 8335-GCA and 8335-GTA” on page 37.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event
information for the 8335-GTB” on page 57.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event
information for the 8348-21C” on page 78.
This ends the procedure.
Identifying service action keywords in system event logs
System event logs (SELs) that have Asserted and any of the keywords indicated below in the description
require a service action.
Temperature, voltage, and current service action keywords
v Transition to Critical from Less Severe
v Transition to Critical from Non-recoverable
v Transition to Non-recoverable
Fan service action keywords
v Transition to Critical from Less Severe
v Transition to Non-recoverable from Less Severe
v Transition to Critical from Non-recoverable
v Device Removed / Device Absent
v Transition to degraded
v Install error
v Redundancy lost
v Non-redundant insufficient resources
Memory service action keywords
v Configuration Error
v Transition to Non-recoverable
v Predictive Failure
Processor service action keywords
v IERR
v Transition to Non-recoverable
v Predictive Failure
Power supply and All PGood service action keywords
v Power Supply Failure Detected
v Predictive Failure
36Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
v Power Supply Input Lost or AC DC
v Power Supply Input Lost Or Out of Range
v Power Supply Input Out of Range But Present
v Configuration Error
v Transition to Critical from Less Severe
v Transition to Non-recoverable from Less Severe
v Transition to Critical from Non-recoverable
v Transition to Non-recoverable
v Redundancy lost
v Non-redundant insufficient resources
v AC Lost
v Soft Power Control Failure
v Power Unit Failure Detected
v Predictive Failure
System firmware service action keywords
v System Firmware Error
v System Firmware Hang
v Transition to Critical from Less Severe
v Transition to Non-recoverable from Less Severe
v Transition to Critical from Non-recoverable
v Transition to Non-recoverable
System ACPI power state service action keywords
v Unknown
Watchdog service action keywords
v Hard Reset
v Power Down
v Power Cycle
v Timer Interrupt
System event service action keywords
v Undetermined system hardware failure
OS boot service action keywords
v Installation aborted
v Installation failed
Identifying a service action by using sensor and event information
You can use sensor and event information from the system event log (SEL) to determine a service action.
Identifying a service action by using sensor and event information for the
8335-GCA and 8335-GTA
You can use the sensor and event information from the system event log (SEL) to determine a service
action to perform for the IBM Power®System S822LC (8335-GCA and 8335-GTA).
If you have not done so already, complete “Identifying a service action by using system event logs” on
page 27. Then, use the following table to determine the service action to perform.
Beginning troubleshooting and problem analysis37
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA
Sensor name (Sensor ID)Event descriptionService action
Watchdog (0x00)
Host Status (0x04)UnknownGo to Getting fixes and update the
v Timer Expired
v Reserved1
v Reserved2
v Reserved3
v Reserved4
v Hard Reset
v Power Down
v Power Cycle
v Timer Interrupt
v S0/Go “Working”
v S1 “Sleeping with system h/w &
processor context maintained”
v S2 “sleeping, processor context
lost”
v S3 “sleeping, processor & h/w
context lost, memory retained”
v S4 “non-volatile sleep / suspend-to
disk”
v S5 / G2: “soft-off”
v S4 / S5: “soft-off”
v G3 mechanical Off
v Sleeping in an S1/S2/S3 State
v G1: Sleeping
v S5: entered by override
v Legacy ON state
v Legacy OFF state
No service action is required.
SEL events with OEM record c0 |000e000 | 3a150xxxxxxx indicate that
a boot failed. Search for boot failure
SEL events that have a time stamp in
close proximity to the time stamp of
this SEL event. If events exist, go to
“Resolving a system firmware boot
failure” on page 4. If there are no
boot failure SEL events and the
system booted correctly, no service
action is required.
system firmware to the most recent
level of firmware that is available. If
this SEL event continues to be logged
each time you power on the system,
go to “Collecting diagnostic data” on
page 109. Then, go to “Contacting
IBM service and support” on page
110.
No service action is required.
38Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
FW Boot Progress (0x05)
v System Firmware Error
v System Firmware Hang
SEL events with OEM record c0 |
000e000 | 3a150xxxxxxx indicate that
a boot failed. Search for boot failure
SEL events that have a time stamp in
close proximity to the time stamp of
this SEL event. If events exist, go to
“Resolving a system firmware boot
failure” on page 4.
System Firmware ProgressNo service action is required.
v OCC 1 Active (0x08)
v OCC 2 Active (0x09)
Device DisabledIf the sensor name is OCC 1 Active,
replace CPU 1. If the sensor name is
OCC 2 Active, replace CPU 2. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
v State Deasserted
No service action is required.
v Device Enabled
Ambient Temp (0x0A)
v Upper Critical - going low
No service action is required.
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Lower Critical - going low
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Upper Critical - going highEnsure that the room temperature
meets the requirements that are
specified for the system. Ensure that
no obstructions are blocking air flow
to the system.
Beginning troubleshooting and problem analysis39
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU1 Temp (0x0B)
v CPU2 Temp (0x0D)
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical - going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Lower Critical - going low
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
40Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU Func 1 (0x0C)
v CPU Func 2 (0x0E)
v IERR
v Transition to Non-recoverable
v Predictive Failure
If the sensor name is CPU Func 1,
replace CPU 1. If the sensor name is
CPU Func 2, replace CPU 2. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
v Thermal Trip
No service action is required.
v FRB1 BIST Failure
v FRB2 Hang In POST Failure
v FRB3 Processor Startup
Initialization Failure
v Configuration Error
v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled
v Terminator Presence Detected
v Processor Automatically Throttled
v Machine Check Exception
v Correctable Machine Check Error
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Processor Presence Detected
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
Beginning troubleshooting and problem analysis41
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
All PGood (0x1C)
v Interlock Power Down
v Power Off Power Down
v Power Cycle
v 240VA Power Down
No service action is required.
v AC Lost
v Soft Power Control Failure
v Power Unit Failure Detected
v Predictive Failure
v Ensure that ac power is supplied
to the rack.
v Ensure that the system power
cords are plugged tightly into both
the power supply and the rack
power distribution unit (PDU) for
both system power supplies.
v Ensure that the system was not
powered off.
v Ensure that ac power is supplied
to the rack.
v Ensure that the power supply
cords are plugged tightly into the
power supplies and the rack PDU
unit.
v Ensure that the system was not
powered off.
v Check for service action required
SEL events for the power supply
sensor. If any exist, follow the
service action that is specified in
“Identifying a service action by
using sensor and event information
for the 8335-GCA and 8335-GTA”
on page 37.
42Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v DIMM Func 1 (0x1E)
v DIMM Func 2 (0x1F)
v DIMM Func 3 (0x20)
v DIMM Func 4 (0x21)
v DIMM Func 5 (0x22)
v DIMM Func 6 (0x23)
v DIMM Func 7 (0x24)
v DIMM Func 8 (0x25)
v DIMM Func 9 (0x26)
v DIMM Func 10 (0x27)
v DIMM Func 11 (0x28)
v DIMM Func 12 (0x29)
v DIMM Func 13 (0x2A)
v DIMM Func 14 (0x2B)
v DIMM Func 15 (0x2C)
v DIMM Func 16 (0x2D)
v DIMM Func 17 (0x2E)
v DIMM Func 18 (0x2F)
v DIMM Func 19 (0x30)
v DIMM Func 20 (0x31)
v DIMM Func 21 (0x32)
v DIMM Func 22 (0x33)
v DIMM Func 23 (0x34)
v DIMM Func 24 (0x35)
v DIMM Func 25 (0x36)
v DIMM Func 26 (0x37)
v DIMM Func 27 (0x38)
v DIMM Func 28 (0x39)
v DIMM Func 29 (0x3A)
v DIMM Func 30 (0x3B)
v DIMM Func 31 (0x3C)
v DIMM Func 32 (0x3D)
v Memory Device Disabled
v Uncorrectable Memory Error
v Memory Scrub Failed
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Correctable Memory Error
v Parity
v Correctable Memory Error Logging
Limit Reached
v Memory Automatically Throttled
v Critical Over temperature
v Presence Detected
v Spare
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
v Transition to Non-recoverable
v Predictive Failure
No service action is required.
If the sensor name is DIMM Func 1,
replace DIMM 1. If the sensor name
is DIMM Func 2, replace DIMM 2.
And so on. Go to “8335-GCA and
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
Beginning troubleshooting and problem analysis43
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v DIMM Func 1 (0x1E)
v DIMM Func 2 (0x1F)
v DIMM Func 3 (0x20)
v DIMM Func 4 (0x21)
v DIMM Func 5 (0x22)
v DIMM Func 6 (0x23)
v DIMM Func 7 (0x24)
v DIMM Func 8 (0x25)
v DIMM Func 9 (0x26)
v DIMM Func 10 (0x27)
v DIMM Func 11 (0x28)
v DIMM Func 12 (0x29)
v DIMM Func 13 (0x2A)
v DIMM Func 14 (0x2B)
v DIMM Func 15 (0x2C)
v DIMM Func 16 (0x2D)
v DIMM Func 17 (0x2E)
v DIMM Func 18 (0x2F)
v DIMM Func 19 (0x30)
v DIMM Func 20 (0x31)
v DIMM Func 21 (0x32)
v DIMM Func 22 (0x33)
v DIMM Func 23 (0x34)
v DIMM Func 24 (0x35)
v DIMM Func 25 (0x36)
v DIMM Func 26 (0x37)
v DIMM Func 27 (0x38)
v DIMM Func 28 (0x39)
v DIMM Func 29 (0x3A)
v DIMM Func 30 (0x3B)
v DIMM Func 31 (0x3C)
v DIMM Func 32 (0x3D)
Configuration ErrorComplete the following steps:
1. If the sensor name is DIMM Func
1, ensure that DIMM 1 is seated
properly. If the sensor name is
DIMM Func 2, ensure that DIMM
2 is seated properly. And so on.
2. If you recently installed or
replaced memory DIMMs, ensure
that the DIMMs are plugged in
the correct memory slots.
3. If the sensor name is DIMM Func
1, replace DIMM 1. If the sensor
name is DIMM Func 2, replace
DIMM 2. And so on. Go to
“8335-GCA and 8335-GTA
locations” on page 111 to identify
the physical location and removal
and replacement procedure.
44Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU Core Func 1 (0x3E)
v CPU Core Func 2 (0x3F)
v CPU Core Func 3 (0x40)
v CPU Core Func 4 (0x41)
v CPU Core Func 5 (0x42)
v CPU Core Func 6 (0x43)
v CPU Core Func 7 (0x44)
v CPU Core Func 8 (0x45)
v CPU Core Func 9 (0x46)
v CPU Core Func 10 (0x47)
v CPU Core Func 11 (0x48)
v CPU Core Func 12 (0x49)
v IERR
v Transition to Non-recoverable
v Predictive Failure
v FRB1 BIST Failure
v FRB2 Hang In POST Failure
v FRB3 Processor Startup
Initialization Failure
v Configuration Error
v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled
Replace system processor CPU 1. Go
to “8335-GCA and 8335-GTA
locations” on page 111 to identify the
physical location and removal and
replacement procedure.
No service action is required.
v Terminator Presence Detected
v Machine Check Exception
v Correctable Machine Check Error
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Thermal Trip
v Processor Automatically Throttled
v Processor Presence Detected
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
Beginning troubleshooting and problem analysis45
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU Core Func 13 (0x4A)
v CPU Core Func 14 (0x4B)
v CPU Core Func 15 (0x4C)
v CPU Core Func 16 (0x4D)
v CPU Core Func 17 (0x4E)
v CPU Core Func 18 (0x4F)
v CPU Core Func 19 (0x50)
v CPU Core Func 20 (0x51)
v CPU Core Func 21 (0x52)
v CPU Core Func 22 (0x53)
v CPU Core Func 23 (0x54)
v CPU Core Func 24 (0x55)
v IERR
v Transition to Non-recoverable
v Predictive Failure
v FRB1 BIST Failure
v FRB2 Hang In POST Failure
v FRB3 Processor Startup
Initialization Failure
v Configuration Error
v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled
v Terminator Presence Detected
v Machine Check Exception
v Correctable Machine Check Error
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Thermal Trip
v Processor Automatically Throttled
v Processor Presence Detected
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
Replace system processor CPU 2. Go
to “8335-GCA and 8335-GTA
locations” on page 111 to identify the
physical location and removal and
replacement procedure.
No service action is required.
46Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v Mem Buf Func 1 (0x56)
v Mem Buf Func 2 (0x57)
v Mem Buf Func 3 (0x58)
v Mem Buf Func 4 (0x59)
v Mem Buf Func 5 (0x5A)
v Mem Buf Func 6 (0x5B)
v Mem Buf Func 7 (0x5C)
v Mem Buf Func 8 (0x5D)
v Uncorrectable Memory Error
v Memory Device Disabled
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
No service action is required.
Non-recoverable
v Correctable Memory Error
v Parity
v Memory Scrub Failed
v Correctable Memory Error Logging
Limit Reached
v Memory Automatically Throttled
v Critical Over temperature
v Presence Detected
v Spare
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
v Configuration Error
v Transition to Non-recoverable
v Predictive Failure
If the sensor name is Mem Buf Func
1, replace memory riser 1. If the
sensor name is Mem Buf Func 2,
replace memory riser 2. And so on.
Go to “8335-GCA and 8335-GTA
locations” on page 111 to identify the
physical location and removal and
replacement procedure.
Boot Count (0x5F)NoneNo service action is required.
Motherboard Flt (0x60)State DeassertedNo service action is required.
State AssertedReplace the system backplane. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
Beginning troubleshooting and problem analysis47
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
System Event (0x61)Undetermined system hardware
failure
v System Reconfigured
v OEM System boot event
v Entry added to auxiliary log
v PEF Action
v Timestamp Clock Sync
v Transition State Active
v Transition State Idle
v Transition State Busy
Activate Pwr Lt (0x62)NoneNo service action is required.
v Ref Clock Fault (0x63)
v PCI Clock Fault (0x64)
v State Deasserted
v State Asserted
Go to “Collecting diagnostic data” on
page 109. Then, go to “Contacting
IBM service and support” on page
110.
No service action is required.
No service action is required.
48Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v DIMM1 Temp (0x69)
v DIMM2 Temp (0x6A)
v DIMM3 Temp (0x6B)
v DIMM4 Temp (0X6C)
v DIMM5 Temp (0x6D)
v DIMM6 Temp (0x6E)
v DIMM7 Temp (0x6F)
v DIMM8 Temp (0x70)
v DIMM9 Temp (0x71)
v DIMM10 Temp (0x72)
v DIMM11 Temp (0x73)
v DIMM12 Temp (0x74)
v DIMM13 Temp (0x75)
v DIMM14 Temp (0x76)
v DIMM15 Temp (0x77)
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
v DIMM16 Temp (0x78)
v DIMM17 Temp (0x79)
v DIMM18 Temp (0x7A)
v DIMM19 Temp (0x7B)
v DIMM20 Temp (0x7C)
v DIMM21 Temp (0x7D)
v DIMM22 Temp (0x7E)
v DIMM23 Temp (0x7F)
v DIMM24 Temp (0x80)
v DIMM25 Temp (0x81)
v DIMM26 Temp (0x82)
v DIMM27 Temp (0x83)
v DIMM28 Temp (0x84)
v DIMM29 Temp (0x85)
v DIMM30 Temp (0x86)
v DIMM31 Temp (0x87)
v DIMM32 Temp (0x88)
Beginning troubleshooting and problem analysis49
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU Core Temp 1 (0x89)
v CPU Core Temp 2 (0x8A)
v CPU Core Temp 3 (0x8B)
v CPU Core Temp 4 (0x8C)
v CPU Core Temp 5 (0x8D)
v CPU Core Temp 6 (0x8E)
v CPU Core Temp 7 (0x8F)
v CPU Core Temp 8 (0x90)
v CPU Core Temp 9 (0x91)
v CPU Core Temp 10 (0x92)
v CPU Core Temp 11 (0x93)
v CPU Core Temp 12 (0x94)
v CPU Core Temp 13 (0x95)
v CPU Core Temp 14 (0x96)
v CPU Core Temp 15 (0x97)
v CPU Core Temp 16 (0x98)
v CPU Core Temp 17 (0x99)
v CPU Core Temp 18 (0x9A)
v CPU Core Temp 19 (0x9B)
v CPU Core Temp 20 (0x9C)
v CPU Core Temp 21 (0x9D)
v CPU Core Temp 22 (0x9E)
v CPU Core Temp 23 (0x9F)
v CPU Core Temp 24 (0xA0)
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
No service action is required.
50Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v 12V Sense (0xA1)
v Proc0 Power (0xA2)
v Proc1 Power (0xA3)
v PCIE Proc0 Pwr (0xA6)
v PCIE Proc1 Pwr (0xA7)
v GPU Sense (0xAA)
v Mem Cache Power (0xAB)
v Mem Proc0 Pwr (0xAC)
v Mem Proc1 Pwr (0xAD)
v Fan Power A (0xB0)
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
No service action required.
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v TOD Clock Fault (0xB1)
v APSS Fault (0xB2)
v State Deasserted
v State Asserted
No service action is required.
PS Derating Factor (0xB4)NoneNo service action is required.
OS Boot (0xB5)
v Installation aborted
v Installation failed
Ensure that the operating system
boot image is loaded. Ensure that the
disk drive or solid-state drive is
ready. Reload the operating system
boot image.
v A: boot completed
No service action is required.
v C: boot completed
v PXE boot completed
v Diagnostic boot completed
v CD-ROM boot completed
v ROM boot completed
v Boot completed - device not
specified
v Installation started
v Installation completed
PCI (0xB6)
v State Deasserted
No service action is required.
v State Asserted
Beginning troubleshooting and problem analysis51
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v GPU Func 1 (0xB8)
v GPU Func 2 (0xB9)
v GPU Func 3 (0xBA)
v GPU Func 4 (0xBB)
v GPU Temp 1 (0xBC)
v GPU Temp 2 (0xBD)
v GPU Temp 3 (0xBE)
v GPU Temp 4 (0xBF)
v Mem Buf Temp 1 (0xC0)
v Mem Buf Temp 2 (0xC1)
v Mem Buf Temp 3 (0xC2)
v Mem Buf Temp 4 (0xC3)
v Mem Buf Temp 5 (0xC4)
v Mem Buf Temp 6 (0xC5)
v Mem Buf Temp 7 (0xC6)
v Mem Buf Temp 8 (0xC7)
v Uncorrectable Memory Error
v Parity
v Memory Scrub Failed
v Memory Device Disabled
v Configuration Error
v Memory Automatically Throttled
v Correctable Memory Error
v Parity
v Correctable Memory Error Logging
Limit Reached
v Presence Detected
v Spare
v Critical Over temperature
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
If the sensor name is GPU Func 1 or
GPU Func 2, replace GPU 1. If the
sensor name is GPU Func 3 or GPU
Func 4, replace GPU 2. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
No service action is required.
No service action is required.
No service action is required.
52Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU Diode 1 (0xC8)
v CPU Diode 2 (0xCB)
v Lower Non-critical – going low
v Lower Non-critical – going high
No service action is required.
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Beginning troubleshooting and problem analysis53
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
Checkstop (0xC9)IERRIf this event immediately precedes a
system power off, no service action is
required. Otherwise, search for SEL
events that meet the following
criteria:
v The event has a time stamp in
close proximity to the time stamp
of this event.
v A service action keyword is
present. For a list of service action
keywords, see “Identifying service
action keywords in system event
logs” on page 36.
v Asserted is in the description.
If you found a SEL event that
matches the criteria, perform the
service action that is indicated in this
table for the SEL event. Otherwise, go
to “Collecting diagnostic data” on
page 109. Then, go to “Contacting
IBM service and support” on page
110.
v Thermal Trip
v Configuration Error
v Processor Automatically Throttled
v Correctable Machine Check Error
v Processor Presence Detected
v FRB1 BIST Failure
v FRB2 Hang In POST Failure
v FRB3 Processor Startup
Initialization Failure
v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled
v Terminator Presence Detected
v Machine Check Exception
No service action is required.
Go to “Collecting diagnostic data” on
page 109. Then, go to “Contacting
IBM service and support” on page
110.
54Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v PSU Fault 1 (0xCD)
v PSU Fault 2 (0xCE)
Power Supply Failure DetectedAn assert event immediately
followed by a deassert event
indicates that a power cycle of the
system occurred. No service action is
required. If there is no deassert event
immediately following the assert
event, replace the power supply. If
the sensor name is PSU Fault 1,
replace PSU 1. If the sensor name is
PSU Fault 2, replace PSU 2. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
v Predictive Failure
v Power Supply Input Out of Range
But Present
If the sensor name is PSU Fault 1,
replace PSU 1. If the sensor name is
PSU Fault 2, replace PSU 2. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
v Power Supply Input Lost or AC
DC
v Power Supply Input Lost Or Out
Of Range
Ensure that ac power is supplied to
the rack. Ensure that the system
power cords are plugged tightly into
both the power supply and the rack
PDU unit for both system power
supplies. Go to “8335-GCA and
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
Configuration ErrorEnsure that both power supplies are
securely seated in the system. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
v Presence Detected
No service action is required.
v Power Supply Inactive
Beginning troubleshooting and problem analysis55
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
CPU VDD Volt (0xCF)
CPU VDD Curr (0xD0)
BIOS Golden Side (0xD2)NoneGo to “Resolving a system firmware
BMC Golden Side (0xD3)NoneGo to “Resolving a system firmware
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
No service action is required.
boot failure” on page 4 and follow
the service action for a system event
log (SEL) with the value OEM recordc0 and OEM c0 specific log
information 3a1504xxxxxx.
boot failure” on page 4 and follow
the service action for a system event
log (SEL) with the value OEM recordc0 and OEM c0 specific log
information 3a1504xxxxxx.
56Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID)Event descriptionService action
v Fan 1 (0xD4)
v Fan 2 (0xD5)
v Fan 3 (0xD6)
v Fan 4 (0xD7)
CurPwr Redundant (0xD8)
NxtPwr Redundant (0xD9)
Turbo Allowed (0xDA)
v Transition to Critical from less
Severe
v Transition to Non-recoverable from
less severe
v Transition to critical from
non-recoverable
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Device Inserted/Device Present
v Device Removed/Device Absent
v Transition to degraded
v Install error
v Redundancy lost
v Non-redundant insufficient
resources
v State Deasserted
v State Asserted
v State Deasserted
v State Asserted
v State Deasserted
v State Asserted
If the sensor name is Fan 1, replace
Fan 1. If the sensor name is Fan 2,
replace Fan 2. And so on. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
No service action is required.
Ensure that all fans are seated
securely. Go to “8335-GCA and
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
No service action is required.
No service action is required.
No service action is required.
Identifying a service action by using sensor and event information for the
8335-GTB
You can use the sensor and event information from the system event log (SEL) to determine a service
action to perform for the IBM Power System S822LC (8335-GTB).
If you have not done so already, complete “Identifying a service action by using system event logs” on
page 27. Then, use the following table to determine the service action to perform.
Beginning troubleshooting and problem analysis57
Table 22. Sensor information, event description, and service action for the 8335-GTB
Sensor name (Sensor ID)Event descriptionService action
Watchdog (0x00)
Host Status (0x04)UnknownGo to Getting fixes and update the
v Timer Expired
v Reserved1
v Reserved2
v Reserved3
v Reserved4
v Hard Reset
v Power Down
v Power Cycle
v Timer Interrupt
v S0/Go “Working”
v S1 “Sleeping with system h/w &
processor context maintained”
v S2 “sleeping, processor context
lost”
v S3 “sleeping, processor & h/w
context lost, memory retained”
v S4 “non-volatile sleep / suspend-to
disk”
v S5 / G2: “soft-off”
v S4 / S5: “soft-off”
v G3 mechanical Off
v Sleeping in an S1/S2/S3 State
v G1: Sleeping
v S5: entered by override
v Legacy ON state
v Legacy OFF state
No service action is required.
SEL events with OEM record c0 |000e000 | 3a150xxxxxxx indicate that
a boot failed. Search for boot failure
SEL events that have a time stamp
close to the time stamp of this SEL
event. If events exist, go to
“Resolving a system firmware boot
failure” on page 4. If there are no
boot failure SEL events and the
system booted correctly, no service
action is required.
system firmware to the most recent
level of firmware that is available. If
this SEL event continues to be logged
each time you power on the system,
go to “Collecting diagnostic data” on
page 109. Then, go to “Contacting
IBM service and support” on page
110.
No service action is required.
58Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
FW Boot Progress (0x05)
v System Firmware Error
v System Firmware Hang
SEL events with OEM record c0 |
000e000 | 3a150xxxxxxx indicate that
a boot failed. Search for boot failure
SEL events that have a time stamp
close to the time stamp of this SEL
event. If events exist, go to
“Resolving a system firmware boot
failure” on page 4.
System Firmware ProgressNo service action is required.
v OCC 1 Active (0x08)
v OCC 2 Active (0x09)
Device DisabledIf the sensor name is OCC 1 Active,
replace CPU 1. If the sensor name is
OCC 2 Active, replace CPU 2. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
v State Deasserted
No service action is required.
v Device Enabled
Ambient Temp (0x0A)
v Upper Critical - going low
No service action is required.
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Lower Critical - going low
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Upper Critical - going highEnsure that the room temperature
meets the requirements that are
specified for the system. Ensure that
no obstructions are blocking air flow
to the system.
Beginning troubleshooting and problem analysis59
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU1 Temp (0x0B)
v CPU2 Temp (0x0D)
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical - going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Lower Critical - going low
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
60Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU Func 1 (0x0C)
v CPU Func 2 (0x0E)
v IERR
v Transition to Non-recoverable
v Predictive Failure
If the sensor name is CPU Func 1,
replace CPU 1. If the sensor name is
CPU Func 2, replace CPU 2. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
v Thermal Trip
No service action is required.
v FRB1 BIST Failure
v FRB2 Hang In POST Failure
v FRB3 Processor Startup
Initialization Failure
v Configuration Error
v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled
v Terminator Presence Detected
v Processor Automatically Throttled
v Machine Check Exception
v Correctable Machine Check Error
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Processor Presence Detected
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
Beginning troubleshooting and problem analysis61
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
All PGood (0x1C)
v Interlock Power Down
v Power Off Power Down
v Power Cycle
v 240VA Power Down
No service action is required.
v AC Lost
v Soft Power Control Failure
v Power Unit Failure Detected
v Predictive Failure
v Ensure that ac power is supplied
to the rack.
v Ensure that the system power
cords are plugged tightly into both
the power supply and the rack
power distribution unit (PDU) for
both system power supplies.
v Ensure that the system was not
powered off.
v Ensure that ac power is supplied
to the rack.
v Ensure that the power supply
cords are plugged tightly into the
power supplies and the rack PDU
unit.
v Ensure that the system was not
powered off.
v Check for service action required
SEL events for the power supply
sensor. If any exist, follow the
service action that is specified in
“Identifying a service action by
using sensor and event information
for the 8335-GTB” on page 57.
62Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v DIMM Func 1 (0x1E)
v DIMM Func 2 (0x1F)
v DIMM Func 3 (0x20)
v DIMM Func 4 (0x21)
v DIMM Func 5 (0x22)
v DIMM Func 6 (0x23)
v DIMM Func 7 (0x24)
v DIMM Func 8 (0x25)
v DIMM Func 9 (0x26)
v DIMM Func 10 (0x27)
v DIMM Func 11 (0x28)
v DIMM Func 12 (0x29)
v DIMM Func 13 (0x2A)
v DIMM Func 14 (0x2B)
v DIMM Func 15 (0x2C)
v DIMM Func 16 (0x2D)
v DIMM Func 17 (0x2E)
v DIMM Func 18 (0x2F)
v DIMM Func 19 (0x30)
v DIMM Func 20 (0x31)
v DIMM Func 21 (0x32)
v DIMM Func 22 (0x33)
v DIMM Func 23 (0x34)
v DIMM Func 24 (0x35)
v DIMM Func 25 (0x36)
v DIMM Func 26 (0x37)
v DIMM Func 27 (0x38)
v DIMM Func 28 (0x39)
v DIMM Func 29 (0x3A)
v DIMM Func 30 (0x3B)
v DIMM Func 31 (0x3C)
v DIMM Func 32 (0x3D)
v Memory Device Disabled
v Uncorrectable Memory Error
v Memory Scrub Failed
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Correctable Memory Error
v Parity
v Correctable Memory Error Logging
Limit Reached
v Memory Automatically Throttled
v Critical Over temperature
v Presence Detected
v Spare
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
v Transition to Non-recoverable
v Predictive Failure
No service action is required.
If the sensor name is DIMM Func 1,
replace DIMM 1. If the sensor name
is DIMM Func 2, replace DIMM 2.
And so on. Go to “8335-GTB
locations” on page 121 to identify the
physical location and removal and
replacement procedure.
Beginning troubleshooting and problem analysis63
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v DIMM Func 1 (0x1E)
v DIMM Func 2 (0x1F)
v DIMM Func 3 (0x20)
v DIMM Func 4 (0x21)
v DIMM Func 5 (0x22)
v DIMM Func 6 (0x23)
v DIMM Func 7 (0x24)
v DIMM Func 8 (0x25)
v DIMM Func 9 (0x26)
v DIMM Func 10 (0x27)
v DIMM Func 11 (0x28)
v DIMM Func 12 (0x29)
v DIMM Func 13 (0x2A)
v DIMM Func 14 (0x2B)
v DIMM Func 15 (0x2C)
v DIMM Func 16 (0x2D)
v DIMM Func 17 (0x2E)
v DIMM Func 18 (0x2F)
v DIMM Func 19 (0x30)
v DIMM Func 20 (0x31)
v DIMM Func 21 (0x32)
v DIMM Func 22 (0x33)
v DIMM Func 23 (0x34)
v DIMM Func 24 (0x35)
v DIMM Func 25 (0x36)
v DIMM Func 26 (0x37)
v DIMM Func 27 (0x38)
v DIMM Func 28 (0x39)
v DIMM Func 29 (0x3A)
v DIMM Func 30 (0x3B)
v DIMM Func 31 (0x3C)
v DIMM Func 32 (0x3D)
Configuration ErrorComplete the following steps:
1. If the sensor name is DIMM Func
1, ensure that DIMM 1 is seated
properly. If the sensor name is
DIMM Func 2, ensure that DIMM
2 is seated properly. And so on.
2. If you recently installed or
replaced memory DIMMs, ensure
that the DIMMs are plugged in
the correct memory slots.
3. If the sensor name is DIMM Func
1, replace DIMM 1. If the sensor
name is DIMM Func 2, replace
DIMM 2. And so on. Go to
“8335-GTB locations” on page 121
to identify the physical location
and removal and replacement
procedure.
64Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU Core Func 1 (0x3E)
v CPU Core Func 2 (0x3F)
v CPU Core Func 3 (0x40)
v CPU Core Func 4 (0x41)
v CPU Core Func 5 (0x42)
v CPU Core Func 6 (0x43)
v CPU Core Func 7 (0x44)
v CPU Core Func 8 (0x45)
v CPU Core Func 9 (0x46)
v CPU Core Func 10 (0x47)
v CPU Core Func 11 (0x48)
v CPU Core Func 12 (0x49)
v IERR
v Transition to Non-recoverable
v Predictive Failure
v FRB1 BIST Failure
v FRB2 Hang In POST Failure
v FRB3 Processor Startup
Initialization Failure
v Configuration Error
v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled
v Terminator Presence Detected
Replace system processor CPU 1. Go
to “8335-GTB locations” on page 121
to identify the physical location and
removal and replacement procedure.
No service action is required.
v Machine Check Exception
v Correctable Machine Check Error
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Thermal Trip
v Processor Automatically Throttled
v Processor Presence Detected
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
Beginning troubleshooting and problem analysis65
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU Core Func 13 (0x4A)
v CPU Core Func 14 (0x4B)
v CPU Core Func 15 (0x4C)
v CPU Core Func 16 (0x4D)
v CPU Core Func 17 (0x4E)
v CPU Core Func 18 (0x4F)
v CPU Core Func 19 (0x50)
v CPU Core Func 20 (0x51)
v CPU Core Func 21 (0x52)
v CPU Core Func 22 (0x53)
v CPU Core Func 23 (0x54)
v CPU Core Func 24 (0x55)
v IERR
v Transition to Non-recoverable
v Predictive Failure
v FRB1 BIST Failure
v FRB2 Hang In POST Failure
v FRB3 Processor Startup
Initialization Failure
v Configuration Error
v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled
v Terminator Presence Detected
v Machine Check Exception
v Correctable Machine Check Error
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Thermal Trip
v Processor Automatically Throttled
v Processor Presence Detected
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
Replace system processor CPU 2. Go
to “8335-GTB locations” on page 121
to identify the physical location and
removal and replacement procedure.
No service action is required.
66Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v Mem Buf Func 1 (0x56)
v Mem Buf Func 2 (0x57)
v Mem Buf Func 3 (0x58)
v Mem Buf Func 4 (0x59)
v Mem Buf Func 5 (0x5A)
v Mem Buf Func 6 (0x5B)
v Mem Buf Func 7 (0x5C)
v Mem Buf Func 8 (0x5D)
v Uncorrectable Memory Error
v Memory Device Disabled
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
No service action is required.
Non-recoverable
v Correctable Memory Error
v Parity
v Memory Scrub Failed
v Correctable Memory Error Logging
Limit Reached
v Memory Automatically Throttled
v Critical Over temperature
v Presence Detected
v Spare
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
v Configuration Error
v Transition to Non-recoverable
v Predictive Failure
If the sensor name is Mem Buf Func
1, replace memory riser 1. If the
sensor name is Mem Buf Func 2,
replace memory riser 2. And so on.
Go to “8335-GTB locations” on page
121 to identify the physical location
and removal and replacement
procedure.
Boot Count (0x5F)NoneNo service action is required.
Motherboard Flt (0x60)State DeassertedNo service action is required.
State AssertedReplace the system backplane. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
Beginning troubleshooting and problem analysis67
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
System Event (0x61)Undetermined system hardware
failure
v System Reconfigured
v OEM System boot event
v Entry added to auxiliary log
v PEF Action
v Timestamp Clock Sync
v Transition State Active
v Transition State Idle
v Transition State Busy
Activate Pwr Lt (0x62)NoneNo service action is required.
v Ref Clock Fault (0x63)
v PCI Clock Fault (0x64)
v State Deasserted
v State Asserted
Go to “Collecting diagnostic data” on
page 109. Then, go to “Contacting
IBM service and support” on page
110.
No service action is required.
No service action is required.
68Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v DIMM1 Temp (0x69)
v DIMM2 Temp (0x6A)
v DIMM3 Temp (0x6B)
v DIMM4 Temp (0x6C)
v DIMM5 Temp (0x6D)
v DIMM6 Temp (0x6E)
v DIMM7 Temp (0x6F)
v DIMM8 Temp (0x70)
v DIMM9 Temp (0x71)
v DIMM10 Temp (0x72)
v DIMM11 Temp (0x73)
v DIMM12 Temp (0x74)
v DIMM13 Temp (0x75)
v DIMM14 Temp (0x76)
v DIMM15 Temp (0x77)
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
v DIMM16 Temp (0x78)
v DIMM17 Temp (0x79)
v DIMM18 Temp (0x7A)
v DIMM19 Temp (0x7B)
v DIMM20 Temp (0x7C)
v DIMM21 Temp (0x7D)
v DIMM22 Temp (0x7E)
v DIMM23 Temp (0x7F)
v DIMM24 Temp (0x80)
v DIMM25 Temp (0x81)
v DIMM26 Temp (0x82)
v DIMM27 Temp (0x83)
v DIMM28 Temp (0x84)
v DIMM29 Temp (0x85)
v DIMM30 Temp (0x86)
v DIMM31 Temp (0x87)
v DIMM32 Temp (0x88)
Beginning troubleshooting and problem analysis69
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU Core Temp 1 (0x89)
v CPU Core Temp 2 (0x8A)
v CPU Core Temp 3 (0x8B)
v CPU Core Temp 4 (0x8C)
v CPU Core Temp 5 (0x8D)
v CPU Core Temp 6 (0x8E)
v CPU Core Temp 7 (0x8F)
v CPU Core Temp 8 (0x90)
v CPU Core Temp 9 (0x91)
v CPU Core Temp 10 (0x92)
v CPU Core Temp 11 (0x93)
v CPU Core Temp 12 (0x94)
v CPU Core Temp 13 (0x95)
v CPU Core Temp 14 (0x96)
v CPU Core Temp 15 (0x97)
v CPU Core Temp 16 (0x98)
v CPU Core Temp 17 (0x99)
v CPU Core Temp 18 (0x9A)
v CPU Core Temp 19 (0x9B)
v CPU Core Temp 20 (0x9C)
v CPU Core Temp 21 (0x9D)
v CPU Core Temp 22 (0x9E)
v CPU Core Temp 23 (0x9F)
v CPU Core Temp 24 (0xA0)
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
No service action is required.
70Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v System Power (0xA1)
v Proc0 Power (0xA2)
v Proc1 Power (0xA3)
v PCIE Proc0 Pwr (0xA6)
v PCIE Proc1 Power (0xA7)
v GPU Power (0xAA)
v Mem Cache Power (0xAB)
v Mem Proc0 Pwr (0xAC)
v Mem Proc1 Pwr (0xAD)
v Fan Power (0xB0)
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
No service action required.
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v TOD Clock Fault (0xB1)
v APSS Fault (0xB2)
v State Deasserted
v State Asserted
No service action is required.
PS Derating Fac (0xB4)NoneNo service action is required.
OS Boot (0xB5)
v Installation aborted
v Installation failed
Ensure that the operating system
boot image is loaded. Ensure that the
disk drive or solid-state drive is
ready. Reload the operating system
boot image.
v A: boot completed
No service action is required.
v C: boot completed
v PXE boot completed
v Diagnostic boot completed
v CD-ROM boot completed
v ROM boot completed
v Boot completed - device not
specified
v Installation started
v Installation completed
PCI (0xB6)
v State Deasserted
No service action is required.
v State Asserted
Beginning troubleshooting and problem analysis71
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v GPU Func 1 (0xB8)
v GPU Func 2 (0xB9)
v GPU Func 3 (0xBA)
v GPU Func 4 (0xBB)
v GPU Temp 1 (0xBC)
v GPU Temp 2 (0xBD)
v GPU Temp 3 (0xBE)
v GPU Temp 4 (0xBF)
v Mem Buf Temp 1 (0xC0)
v Mem Buf Temp 2 (0xC1)
v Mem Buf Temp 3 (0xC2)
v Mem Buf Temp 4 (0xC3)
v Mem Buf Temp 5 (0xC4)
v Mem Buf Temp 6 (0xC5)
v Mem Buf Temp 7 (0xC6)
v Mem Buf Temp 8 (0xC7)
v Uncorrectable Memory Error
v Parity
v Memory Scrub Failed
v Memory Device Disabled
v Configuration Error
v Memory Automatically Throttled
v Correctable Memory Error
v Parity
v Correctable Memory Error Logging
Limit Reached
v Presence Detected
v Spare
v Critical Over temperature
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
If the sensor name is GPU Func 1,
replace GPU 1. If the sensor name is
GPU Func 2, replace GPU 2. And so
on. Go to “8335-GTB locations” on
page 121 to identify the physical
location and removal and
replacement procedure.
No service action is required.
No service action is required.
No service action is required.
72Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v CPU Diode 1 (0xC8)
v CPU Diode 2 (0xCB)
v Lower Non-critical – going low
v Lower Non-critical – going high
No service action is required.
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Beginning troubleshooting and problem analysis73
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
Checkstop (0xC9)IERRIf this event immediately precedes a
system power off, no service action is
required. Otherwise, search for SEL
events that meet the following
criteria:
v The event has a time stamp close
to the time stamp of this event.
v A service action keyword is
present. For a list of service action
keywords, see “Identifying service
action keywords in system event
logs” on page 36.
v Asserted is in the description.
If you found a SEL event that
matches the criteria, perform the
service action that is indicated in this
table for the SEL event. Otherwise, go
to “Collecting diagnostic data” on
page 109. Then, go to “Contacting
IBM service and support” on page
110.
v Thermal Trip
v Configuration Error
v Processor Automatically Throttled
v Correctable Machine Check Error
v Processor Presence Detected
v FRB1 BIST Failure
v FRB2 Hang In POST Failure
v FRB3 Processor Startup
Initialization Failure
v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled
v Terminator Presence Detected
v Machine Check Exception
No service action is required.
Go to “Collecting diagnostic data” on
page 109. Then, go to “Contacting
IBM service and support” on page
110.
74Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v PSU Fault 1 (0xCD)
v PSU Fault 2 (0xCE)
Power Supply Failure DetectedAn assert event immediately
followed by a deassert event
indicates that a power cycle of the
system occurred. No service action is
required. If there is no deassert event
immediately following the assert
event, replace the power supply. If
the sensor name is PSU Fault 1,
replace PSU 1. If the sensor name is
PSU Fault 2, replace PSU 2. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
v Predictive Failure
v Power Supply Input Out of Range
But Present
If the sensor name is PSU Fault 1,
replace PSU 1. If the sensor name is
PSU Fault 2, replace PSU 2. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
v Power Supply Input Lost or AC
DC
v Power Supply Input Lost Or Out
Of Range
Ensure that ac power is supplied to
the rack. Ensure that the system
power cords are plugged tightly into
both the power supply and the rack
PDU unit for both system power
supplies. Go to “8335-GTB locations”
on page 121 to identify the physical
location and removal and
replacement procedure.
Configuration ErrorEnsure that both power supplies are
securely seated in the system. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
v Presence Detected
No service action is required.
v Power Supply Inactive
CPU VDD Volt (0xCF)
v Lower Non-critical – going low
No service action is required.
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Beginning troubleshooting and problem analysis75
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
CPU VDD Curr (0xD0)
BIOS Golden Side (0xD2)NoneGo to “Resolving a system firmware
BMC Golden Side (0xD3)NoneGo to “Resolving a system firmware
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
boot failure” on page 4 and follow
the service action for a system event
log (SEL) with the value OEM recordc0 and OEM c0 specific log
information 3a1504xxxxxx.
boot failure” on page 4 and follow
the service action for a system event
log (SEL) with the value OEM recordc0 and OEM c0 specific log
information 3a1504xxxxxx.
76Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v Fan 1 (0xD4)
v Fan 2 (0xD5)
v Fan 3 (0xD6)
v Fan 4 (0xD7)
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going low
v Lower Critical – going high
No service action is required.
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Device Inserted/Device Present
v Device Removed/Device Absent
v Transition to degraded
v Install error
v Redundancy lost
Ensure that all fans are seated
securely. Go to “8335-GTB locations”
on page 121 to identify the physical
location and removal and
replacement procedure.
v Non-redundant insufficient
resources
CurPwr Redundant (0xD8)
v State Deasserted
No service action is required.
v State Asserted
NxtPwr Redundant (0xD9)
v State Deasserted
No service action is required.
v State Asserted
Turbo Allowed (0xDA)
v State Deasserted
No service action is required.
v State Asserted
v Freq Limit OT 1 (0xDB)
v Freq Limit OT 2 (0xDF)
v Freq Limit Pwr 1 (0xDC)
v Freq Limit Pwr 2 (0xE0)
v Mem Thrtl OT 1 (0xDD)
v Mem Thrtl OT 2 (0xE1)
v State Deasserted
v State Asserted
v State Deasserted
v State Asserted
v State Deasserted
v State Asserted
No service action is required.
No service action is required.
No service action is required.
Beginning troubleshooting and problem analysis77
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID)Event descriptionService action
v Quick Pwr Drop 1 (0xDE)
v Quick Pwr Drop 2 (0xE2)
Water Cooled (0xE3)NoneNo service action is required.
CPU 1 VDD Temp (0xE4)Upper Critical - going highIf the system is a water-cooled
CPU 2 VDD Temp (0xE5)Upper Critical - going highIf the system is a water-cooled
State DeassertedNo service action is required.
State Asserted
v Ensure that ac power is supplied
to the rack.
v Ensure that the power supply
cords are plugged tightly into the
power supplies and the rack PDU
unit.
v Check for service action required
SEL events for the power supply
sensor. If any exist, follow the
service action that is specified in
“Identifying a service action by
using sensor and event information
for the 8335-GTB” on page 57.
system, go to “Resolving an over
temperature problem for a
water-cooled 8335-GTB system” on
page 26. If the system is an air-cooled
system, ensure that there are no air
flow obstructions at the front or at
the rear of the system. Ensure that
the fans are operating properly.
system, go to “Resolving an over
temperature problem for a
water-cooled 8335-GTB system” on
page 26. If the system is an air-cooled
system, ensure that there are no air
flow obstructions at the front or at
the rear of the system. Ensure that
the fans are operating properly.
Identifying a service action by using sensor and event information for the
8348-21C
You can use the sensor and event information from the system event log to determine a service action to
perform for the IBM Power System S812LC (8348-21C).
If you have not done so already, complete “Identifying a service action by using system event logs” on
page 27. Then, use the following table to determine the service action to perform.
78Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 23. Sensor information, event description, and service action for the 8348-21C
Sensor name (Sensor ID)Event descriptionService action
Watchdog (0x00)
v Timer Expired
No service action is required.
v Reserved1
v Reserved2
v Reserved3
v Reserved4
v Hard Reset
v Power Down
v Power Cycle
v Timer Interrupt
SEL events with OEM record c0 |
000e000 | 3a150xxxxxxx indicate that
a boot failed. Search for boot failure
SEL events that have a time stamp in
close proximity to the time stamp of
this SEL event. If events exist, go to
“Resolving a system firmware boot
failure” on page 4. If there are no
boot failure SEL events and the
system booted correctly, no service
action is required.
Host Status (0x04)UnknownGo to Getting fixes and update the
system firmware to the most recent
level of firmware that is available. If
this SEL event continues to be logged
each time you power on the system,
go to “Collecting diagnostic data” on
page 109. Then, go to “Contacting
IBM service and support” on page
110.
v S0/Go “Working”
No service action is required.
v S1 “Sleeping with system h/w &
processor context maintained”
v S2 “sleeping, processor context
lost”
v S3 “sleeping, processor & h/w
context lost, memory retained”
v S4 “non-volatile sleep / suspend-to
disk”
v S5 / G2: “soft-off”
v S4 / S5: “soft-off”
v G3 mechanical Off
v Sleeping in an S1/S2/S3 State
v G1: Sleeping
v S5: entered by override
v Legacy ON state
v Legacy OFF state
Beginning troubleshooting and problem analysis79
Table 23. Sensor information, event description, and service action for the 8348-21C (continued)
Sensor name (Sensor ID)Event descriptionService action
FW Boot Progress (0x05)
OCC Active (0x08)Device DisabledReplace the system processor. Go to
Ambient Temp (0x0A)
v System Firmware Error
v System Firmware Hang
System Firmware ProgressNo service action is required.
v State Deasserted
v Device Enabled
v Upper Critical - going low
v Lower Non-critical – going low
v Lower Non-critical – going high
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Lower Critical - going low
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Upper Critical - going highEnsure that the room temperature
SEL events with OEM record c0 |000e000 | 3a150xxxxxxx indicate that
a boot failed. Search for boot failure
SEL events that have a time stamp in
close proximity to the time stamp of
this SEL event. If events exist, go to
“Resolving a system firmware boot
failure” on page 4.
“8348-21C locations” on page 133 to
identify the physical location and
removal and replacement procedure.
No service action is required.
No service action is required.
meets the requirements that are
specified for the system. Ensure that
no obstructions are blocking air flow
to the system.
80Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 23. Sensor information, event description, and service action for the 8348-21C (continued)
Sensor name (Sensor ID)Event descriptionService action
CPU Temp (0x64)
v Lower Non-critical – going low
No service action is required.
v Lower Non-critical – going high
v Lower Critical - going low
v Lower Critical – going high
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low
v Upper Non-critical – going high
v Upper Critical - going low
v Upper Critical - going high
v Lower Critical - going low
v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Beginning troubleshooting and problem analysis81
Table 23. Sensor information, event description, and service action for the 8348-21C (continued)
Sensor name (Sensor ID)Event descriptionService action
CPU Func (0x4E)
v IERR
v Transition to Non-recoverable
v Predictive Failure
v Processor Disabled
v Thermal Trip
v FRB1 BIST Failure
v FRB2 Hang In POST Failure
v FRB3 Processor Startup
Initialization Failure
v Configuration Error
v SMBIOS Uncorrectable CPU
Complex Error
v Terminator Presence Detected
v Processor Automatically Throttled
v Machine Check Exception
v Correctable Machine Check Error
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Processor Presence Detected
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
Replace the system processor. Go to
“8348-21C locations” on page 133 to
identify the physical location and
removal and replacement procedure.
No service action is required.
82Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 23. Sensor information, event description, and service action for the 8348-21C (continued)
Sensor name (Sensor ID)Event descriptionService action
All PGood (0x1C)
v Interlock Power Down
No service action is required.
v Power Off Power Down
v Power Cycle
v 240VA Power Down
v AC Lost
v Soft Power Control Failure
v Power Unit Failure Detected
v Predictive Failure
v Ensure that ac power is supplied
to the rack.
v Ensure that the system power
cords are plugged tightly into both
the power supply and the rack
power distribution unit (PDU) for
both system power supplies.
v Ensure that the system was not
powered off.
v Ensure that ac power is supplied
to the rack.
v Ensure that the power supply
cords are plugged tightly into the
power supplies and the rack PDU
unit.
v Ensure that the system was not
powered off.
v Check for service action required
SEL events for the power supply
sensor. If any exist, follow the
service action that is specified in
“Identifying a service action by
using sensor and event information
for the 8348-21C” on page 78.
Beginning troubleshooting and problem analysis83
Table 23. Sensor information, event description, and service action for the 8348-21C (continued)
Sensor name (Sensor ID)Event descriptionService action
v DIMM Func 0 (0x1E)
v DIMM Func 1 (0x1F)
v DIMM Func 2 (0x20)
v DIMM Func 3 (0x21)
v DIMM Func 4 (0x22)
v DIMM Func 5 (0x23)
v DIMM Func 6 (0x24)
v DIMM Func 7 (0x25)
v DIMM Func 8 (0x26)
v DIMM Func 9 (0x27)
v DIMM Func 10 (0x28)
v DIMM Func 11 (0x29)
v DIMM Func 12 (0x2A)
v DIMM Func 13 (0x2B)
v DIMM Func 14 (0x2C)
v DIMM Func 15 (0x2D)
v DIMM Func 16 (0x2E)
v DIMM Func 17 (0x2F)
v DIMM Func 18 (0x30)
v DIMM Func 19 (0x31)
v DIMM Func 20 (0x32)
v DIMM Func 21 (0x33)
v DIMM Func 22 (0x34)
v DIMM Func 23 (0x35)
v DIMM Func 24 (0x36)
v DIMM Func 25 (0x37)
v DIMM Func 26 (0x38)
v DIMM Func 27 (0x39)
v DIMM Func 28 (0x3A)
v DIMM Func 29 (0x3B)
v DIMM Func 30 (0x3C)
v DIMM Func 31 (0x3D)
v Memory Device Disabled
v Uncorrectable Memory Error
v Memory Scrub Failed
v State Deasserted
v Device Disabled
v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Correctable Memory Error
v Parity
v Correctable Memory Error Logging
Limit Reached
v Memory Automatically Throttled
v Critical Over temperature
v Presence Detected
v Spare
v State Asserted
v Device Enabled
v Transition to OK
v Transition to Non-Critical from OK
v Transition to Non-Critical from
More Severe
v Monitor
v Informational
v Transition to Non-recoverable
v Predictive Failure
No service action is required.
If the sensor name is DIMM Func 0,
replace DIMM 0. If the sensor name
is DIMM Func 1, replace DIMM 1.
And so on. Go to “8348-21C
locations” on page 133 to identify the
physical location and removal and
replacement procedure.
84Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.