IBM Power System 8335-GCA, Power System S812LC, Power System 8335-GTB, Power System S822LC, Power System 8348-21C User Manual

...
Power Systems
Problem analysis, system parts, and locations for the IBM Power System S822LC (8335-GCA, 8335-GTA, and 8335-GTB), and IBM Power System S812LC (8348-21C)
IBM
Power Systems
Problem analysis, system parts, and locations for the IBM Power System S822LC (8335-GCA, 8335-GTA, and 8335-GTB), and IBM Power System S812LC (8348-21C)
IBM
Note
This edition applies to IBM Power Systems™servers that contain the POWER8®processor and to all associated models.
© Copyright IBM Corporation 2015, 2019.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contents

Safety notices ................................. v
Beginning troubleshooting and problem analysis .................. 1
Determining the problem analysis procedure to perform..................... 1
Resolving a BMC access problem ............................ 2
Resolving a power problem .............................. 3
Resolving a system firmware boot failure.......................... 4
Resolving a VGA monitor problem ............................ 8
Resolving an operating system boot failure ......................... 9
Resolving a sensor indicator problem ........................... 11
Resolving a hardware problem ............................. 12
Resolving a GPU, PCIe adapter, or device problem ...................... 13
Resolving a RAID adapter problem .......................... 14
Resolving a network adapter problem ......................... 15
Resolving a graphics processing unit problem ....................... 16
Resolving an NVMe Flash adapter problem ....................... 19
Resolving a storage device problem .......................... 20
Identifying the location of the PCIe adapter by using the slot number ............... 21
Identifying the location of the GPU .......................... 22
Identifying the location of the NVMe Flash adapter ..................... 23
Identifying the location of the storage device ....................... 24
User guides for GPUs and PCIe adapters ........................ 25
Resolving an over temperature problem for a water-cooled 8335-GTB system ............. 26
Identifying a service action .............................. 27
Identifying a service action by using system event logs .................... 27
Identifying service action keywords in system event logs ................... 36
Identifying a service action by using sensor and event information ................ 37
Identifying a service action by using sensor and event information for the 8335-GCA and 8335-GTA ... 37
Identifying a service action by using sensor and event information for the 8335-GTB ......... 57
Identifying a service action by using sensor and event information for the 8348-21C ......... 78
Isolation procedures ................................ 96
EPUB_PRC_FIND_DECONFIGURE_PART isolation procedure ................. 96
EPUB_PRC_SP_CODE isolation procedure ........................ 97
EPUB_PRC_PHYP_CODE isolation procedure ....................... 97
EPUB_PRC_ALL_PROCS isolation procedure ....................... 98
EPUB_PRC_ALL_MEMCRDS isolation procedure...................... 98
EPUB_PRC_LVL_SUPPORT isolation procedure ...................... 99
EPUB_PRC_MEMORY_PLUGGING_ERROR isolation procedure ................ 100
EPUB_PRC_FSI_PATH isolation procedure ....................... 100
EPUB_PRC_PROC_AB_BUS isolation procedure...................... 101
EPUB_PRC_PROC_XYZ_BUS isolation procedure ..................... 101
EPUB_PRC_EIBUS_ERROR isolation procedure ...................... 102
EPUB_PRC_POWER_ERROR isolation procedure ..................... 103
EPUB_PRC_MEMORY_UE isolation procedure ...................... 104
EPUB_PRC_HB_CODE isolation procedure ....................... 104
EPUB_PRC_TOD_CLOCK_ERR isolation procedure .................... 106
EPUB_PRC_COOLING_SYSTEM_ERR isolation procedure .................. 106
EPUB_PRC_GPU_ISOLATION_PROCEDURE isolation procedure ................ 107
Verifying a repair ................................. 108
Collecting diagnostic data .............................. 109
Contacting IBM service and support ........................... 110
Finding parts and locations .......................... 111
8335-GCA and 8335-GTA locations ........................... 111
8335-GCA and 8335-GTA parts ............................ 115
© Copyright IBM Corp. 2015, 2019 iii
Finding parts and locations .......................... 121
8335-GTB locations ................................ 121
8335-GTB parts.................................. 125
Finding parts and locations .......................... 133
8348-21C locations................................. 133
8348-21C parts .................................. 138
Notices ................................... 145
Accessibility features for IBM Power Systems servers ..................... 146
Privacy policy considerations ............................. 147
Trademarks ................................... 148
Electronic emission notices .............................. 148
Class A Notices................................. 148
Class B Notices ................................. 152
Terms and conditions................................ 155
iv Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Safety notices

Safety notices may be printed throughout this guide: v DANGER notices call attention to a situation that is potentially lethal or extremely hazardous to
people.
v CAUTION notices call attention to a situation that is potentially hazardous to people because of some
existing condition.
v Attention notices call attention to the possibility of damage to a program, device, system, or data.
World Trade safety information
Several countries require the safety information contained in product publications to be presented in their national languages. If this requirement applies to your country, safety information documentation is included in the publications package (such as in printed documentation, on DVD, or as part of the product) shipped with the product. The documentation contains the safety information in your national language with references to the U.S. English source. Before using a U.S. English publication to install, operate, or service this product, you must first become familiar with the related safety information documentation. You should also refer to the safety information documentation any time you do not clearly understand any safety information in the U.S. English publications.
Replacement or additional copies of safety information documentation can be obtained by calling the IBM Hotline at 1-800-300-8751.
German safety information
Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne § 2 der Bildschirmarbeitsverordnung geeignet.
Laser safety information
IBM®servers can use I/O cards or features that are fiber-optic based and that utilize lasers or LEDs.
Laser compliance
IBM servers may be installed inside or outside of an IT equipment rack.
DANGER: When working on or around the system, observe the following precautions:
Electrical voltage and current from power, telephone, and communication cables are hazardous. To avoid a shock hazard: v If IBM supplied the power cord(s), connect power to this unit only with the IBM provided power cord.
Do not use the IBM provided power cord for any other product.
v Do not open or service any power supply assembly. v Do not connect or disconnect any cables or perform installation, maintenance, or reconfiguration of this
product during an electrical storm.
v The product might be equipped with multiple power cords. To remove all hazardous voltages,
disconnect all power cords. – For AC power, disconnect all power cords from their AC power source. – For racks with a DC power distribution panel (PDP), disconnect the customer’s DC power source to
the PDP.
v When connecting power to the product ensure all power cables are properly connected.
© Copyright IBM Corp. 2015, 2019 v
– For racks with AC power, connect all power cords to a properly wired and grounded electrical
outlet. Ensure that the outlet supplies proper voltage and phase rotation according to the system rating plate.
– For racks with a DC power distribution panel (PDP), connect the customer’s DC power source to
the PDP. Ensure that the proper polarity is used when attaching the DC power and DC power return wiring.
v Connect any equipment that will be attached to this product to properly wired outlets. v When possible, use one hand only to connect or disconnect signal cables. v Never turn on any equipment when there is evidence of fire, water, or structural damage. v Do not attempt to switch on power to the machine until all possible unsafe conditions are corrected. v Assume that an electrical safety hazard is present. Perform all continuity, grounding, and power checks
specified during the subsystem installation procedures to ensure that the machine meets safety requirements.
v Do not continue with the inspection if any unsafe conditions are present. v Before you open the device covers, unless instructed otherwise in the installation and configuration
procedures: Disconnect the attached AC power cords, turn off the applicable circuit breakers located in the rack power distribution panel (PDP), and disconnect any telecommunications systems, networks, and modems.
DANGER:
v Connect and disconnect cables as described in the following procedures when installing, moving, or
opening covers on this product or attached devices. To Disconnect:
1. Turn off everything (unless instructed otherwise).
2. For AC power, remove the power cords from the outlets.
3. For racks with a DC power distribution panel (PDP), turn off the circuit breakers located in the
PDP and remove the power from the Customer's DC power source.
4. Remove the signal cables from the connectors.
5. Remove all cables from the devices.
To Connect:
1. Turn off everything (unless instructed otherwise).
2. Attach all cables to the devices.
3. Attach the signal cables to the connectors.
4. For AC power, attach the power cords to the outlets.
5. For racks with a DC power distribution panel (PDP), restore the power from the Customer's DC
power source and turn on the circuit breakers located in the PDP.
6. Turn on the devices. Sharp edges, corners and joints may be present in and around the system. Use care when handling
equipment to avoid cuts, scrapes and pinching. (D005)
(R001 part 1 of 2):
DANGER: Observe the following precautions when working on or around your IT rack system:
v Heavy equipment–personal injury or equipment damage might result if mishandled. v Always lower the leveling pads on the rack cabinet. v Always install stabilizer brackets on the rack cabinet. v To avoid hazardous conditions due to uneven mechanical loading, always install the heaviest devices
in the bottom of the rack cabinet. Always install servers and optional devices starting from the bottom of the rack cabinet.
v Rack-mounted devices are not to be used as shelves or work spaces. Do not place objects on top of
rack-mounted devices. In addition, do not lean on rack mounted devices and do not use them to stabilize your body position (for example, when working from a ladder).
vi Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
v Each rack cabinet might have more than one power cord.
– For AC powered racks, be sure to disconnect all power cords in the rack cabinet when directed to
disconnect power during servicing.
– For racks with a DC power distribution panel (PDP), turn off the circuit breaker that controls the
power to the system unit(s), or disconnect the customer’s DC power source, when directed to disconnect power during servicing.
v Connect all devices installed in a rack cabinet to power devices installed in the same rack cabinet. Do
not plug a power cord from a device installed in one rack cabinet into a power device installed in a different rack cabinet.
v An electrical outlet that is not correctly wired could place hazardous voltage on the metal parts of the
system or the devices that attach to the system. It is the responsibility of the customer to ensure that the outlet is correctly wired and grounded to prevent an electrical shock.
(R001 part 2 of 2):
CAUTION:
v Do not install a unit in a rack where the internal rack ambient temperatures will exceed the
manufacturer's recommended ambient temperature for all your rack-mounted devices.
v Do not install a unit in a rack where the air flow is compromised. Ensure that air flow is not blocked
or reduced on any side, front, or back of a unit used for air flow through the unit.
v Consideration should be given to the connection of the equipment to the supply circuit so that
overloading of the circuits does not compromise the supply wiring or overcurrent protection. To provide the correct power connection to a rack, refer to the rating labels located on the equipment in the rack to determine the total power requirement of the supply circuit.
v (For sliding drawers.) Do not pull out or install any drawer or feature if the rack stabilizer brackets are
not attached to the rack. Do not pull out more than one drawer at a time. The rack might become unstable if you pull out more than one drawer at a time.
v (For fixed drawers.) This drawer is a fixed drawer and must not be moved for servicing unless specified
by the manufacturer. Attempting to move the drawer partially or completely out of the rack might cause the rack to become unstable or cause the drawer to fall out of the rack.
Safety notices vii
CAUTION: Removing components from the upper positions in the rack cabinet improves rack stability during relocation. Follow these general guidelines whenever you relocate a populated rack cabinet within a room or building.
v Reduce the weight of the rack cabinet by removing equipment starting at the top of the rack
cabinet. When possible, restore the rack cabinet to the configuration of the rack cabinet as you received it. If this configuration is not known, you must observe the following precautions:
– Remove all devices in the 32U position (compliance ID RACK-001 or 22U (compliance ID RR001)
and above. – Ensure that the heaviest devices are installed in the bottom of the rack cabinet. – Ensure that there are little-to-no empty U-levels between devices installed in the rack cabinet
below the 32U (compliance ID RACK-001 or 22U (compliance ID RR001) level, unless the
received configuration specifically allowed it.
v If the rack cabinet you are relocating is part of a suite of rack cabinets, detach the rack cabinet from
the suite.
v If the rack cabinet you are relocating was supplied with removable outriggers they must be
reinstalled before the cabinet is relocated.
v Inspect the route that you plan to take to eliminate potential hazards. v Verify that the route that you choose can support the weight of the loaded rack cabinet. Refer to the
documentation that comes with your rack cabinet for the weight of a loaded rack cabinet.
v Verify that all door openings are at least 760 x 230 mm (30 x 80 in.). v Ensure that all devices, shelves, drawers, doors, and cables are secure. v Ensure that the four leveling pads are raised to their highest position. v Ensure that there is no stabilizer bracket installed on the rack cabinet during movement. v Do not use a ramp inclined at more than 10 degrees. v When the rack cabinet is in the new location, complete the following steps:
– Lower the four leveling pads. – Install stabilizer brackets on the rack cabinet. – If you removed any devices from the rack cabinet, repopulate the rack cabinet from the lowest
position to the highest position.
v If a long-distance relocation is required, restore the rack cabinet to the configuration of the rack
cabinet as you received it. Pack the rack cabinet in the original packaging material, or equivalent. Also lower the leveling pads to raise the casters off of the pallet and bolt the rack cabinet to the pallet.
(R002)
(L001)
DANGER: Hazardous voltage, current, or energy levels are present inside any component that has this
label attached. Do not open any cover or barrier that contains this label. (L001)
(L002)
viii Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
DANGER: Rack-mounted devices are not to be used as shelves or work spaces. (L002)
1
2
!
1
2
1 2
3
4
(L003)
or
or
or
Safety notices ix
1
2
3
4
or
DANGER: Multiple power cords. The product might be equipped with multiple AC power cords or multiple DC power cables. To remove all hazardous voltages, disconnect all power cords and power cables. (L003)
(L007)
CAUTION: A hot surface nearby. (L007)
(L008)
x Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
CAUTION: Hazardous moving parts nearby. (L008)
All lasers are certified in the U.S. to conform to the requirements of DHHS 21 CFR Subchapter J for class 1 laser products. Outside the U.S., they are certified to be in compliance with IEC 60825 as a class 1 laser product. Consult the label on each part for laser certification numbers and approval information.
CAUTION: This product might contain one or more of the following devices: CD-ROM drive, DVD-ROM drive, DVD-RAM drive, or laser module, which are Class 1 laser products. Note the following information:
v Do not remove the covers. Removing the covers of the laser product could result in exposure to
hazardous laser radiation. There are no serviceable parts inside the device.
v Use of the controls or adjustments or performance of procedures other than those specified herein
might result in hazardous radiation exposure.
(C026)
CAUTION: Data processing environments can contain equipment transmitting on system links with laser modules that operate at greater than Class 1 power levels. For this reason, never look into the end of an optical fiber cable or open receptacle. Although shining light into one end and looking into the other end of a disconnected optical fiber to verify the continuity of optic fibers many not injure the eye, this procedure is potentially dangerous. Therefore, verifying the continuity of optical fibers by shining light into one end and looking at the other end is not recommended. To verify continuity of a fiber optic cable, use an optical light source and power meter. (C027)
CAUTION: This product contains a Class 1M laser. Do not view directly with optical instruments. (C028)
CAUTION: Some laser products contain an embedded Class 3A or Class 3B laser diode. Note the following information: laser radiation when open. Do not stare into the beam, do not view directly with optical instruments, and avoid direct exposure to the beam. (C030)
CAUTION: The battery contains lithium. To avoid possible explosion, do not burn or charge the battery.
Do Not:
v ___ Throw or immerse into water v ___ Heat to more than 100°C (212°F) v ___ Repair or disassemble
Exchange only with the IBM-approved part. Recycle or discard the battery as instructed by local regulations. In the United States, IBM has a process for the collection of this battery. For information, call 1-800-426-4333. Have the IBM part number for the battery unit available when you call. (C003)
Safety notices xi
CAUTION: Regarding IBM provided VENDOR LIFT TOOL:
v Operation of LIFT TOOL by authorized personnel only. v LIFT TOOL intended for use to assist, lift, install, remove units (load) up into rack elevations. It is
not to be used loaded transporting over major ramps nor as a replacement for such designated tools like pallet jacks, walkies, fork trucks and such related relocation practices. When this is not practicable, specially trained persons or services must be used (for instance, riggers or movers).
v Read and completely understand the contents of LIFT TOOL operator's manual before using.
Failure to read, understand, obey safety rules, and follow instructions may result in property damage and/or personal injury. If there are questions, contact the vendor's service and support. Local paper manual must remain with machine in provided storage sleeve area. Latest revision manual available on vendor's web site.
v Test verify stabilizer brake function before each use. Do not over-force moving or rolling the LIFT
TOOL with stabilizer brake engaged.
v Do not move LIFT TOOL while platform is raised, except for minor positioning. v Do not exceed rated load capacity. See LOAD CAPACITY CHART regarding maximum loads at
center versus edge of extended platform.
v Only raise load if properly centered on platform. Do not place more than 200 lb (91 kg) on edge of
sliding platform shelf also considering the load's center of mass/gravity (CoG).
v Do not corner load the platform tilt riser accessory option. Secure platform riser tilt option to main
shelf in all four (4x) locations with provided hardware only, prior to use. Load objects are designed to slide on/off smooth platforms without appreciable force, so take care not to push or lean. Keep riser tilt option flat at all times except for final minor adjustment when needed.
v Do not stand under overhanging load. v Do not use on uneven surface, incline or decline (major ramps). v Do not stack loads. v Do not operate while under the influence of drugs or alcohol. v Do not support ladder against LIFT TOOL. v Tipping hazard. Do not push or lean against load with raised platform. v Do not use as a personnel lifting platform or step. No riders. v Do not stand on any part of lift. Not a step. v Do not climb on mast. v Do not operate a damaged or malfunctioning LIFT TOOL machine. v Crush and pinch point hazard below platform. Only lower load in areas clear of personnel and
obstructions. Keep hands and feet clear during operation.
v No Forks. Never lift or move bare LIFT TOOL MACHINE with pallet truck, jack or fork lift. v Mast extends higher than platform. Be aware of ceiling height, cable trays, sprinklers, lights, and
other overhead objects.
v Do not leave LIFT TOOL machine unattended with an elevated load. v Watch and keep hands, fingers, and clothing clear when equipment is in motion. v Turn Winch with hand power only. If winch handle cannot be cranked easily with one hand, it is
probably over-loaded. Do not continue to turn winch past top or bottom of platform travel. Excessive unwinding will detach handle and damage cable. Always hold handle when lowering, unwinding. Always assure self that winch is holding load before releasing winch handle.
v A winch accident could cause serious injury. Not for moving humans. Make certain clicking sound
is heard as the equipment is being raised. Be sure winch is locked in position before releasing handle. Read instruction page before operating this winch. Never allow winch to unwind freely. Freewheeling will cause uneven cable wrapping around winch drum, damage cable, and may cause serious injury. (C048)
Power and cabling information for NEBS (Network Equipment-Building System) GR-1089-CORE
The following comments apply to the IBM servers that have been designated as conforming to NEBS (Network Equipment-Building System) GR-1089-CORE:
xii Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
The equipment is suitable for installation in the following:
v Network telecommunications facilities v Locations where the NEC (National Electrical Code) applies
The intrabuilding ports of this equipment are suitable for connection to intrabuilding or unexposed wiring or cabling only. The intrabuilding ports of this equipment must not be metallically connected to the interfaces that connect to the OSP (outside plant) or its wiring. These interfaces are designed for use as intrabuilding interfaces only (Type 2 or Type 4 ports as described in GR-1089-CORE) and require isolation from the exposed OSP cabling. The addition of primary protectors is not sufficient protection to connect these interfaces metallically to OSP wiring.
Note: All Ethernet cables must be shielded and grounded at both ends.
The ac-powered system does not require the use of an external surge protection device (SPD).
The dc-powered system employs an isolated DC return (DC-I) design. The DC battery return terminal shall not be connected to the chassis or frame ground.
The dc-powered system is intended to be installed in a common bonding network (CBN) as described in GR-1089-CORE.
Safety notices xiii
xiv Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Beginning troubleshooting and problem analysis

This information provides a starting point for analyzing problems.
This information is the starting point for diagnosing and repairing systems. From this point, you are guided to the appropriate information to help you diagnose problems, determine the appropriate repair action, and then complete the necessary steps to repair the system.
Note: Update the system firmware to the latest level before you start problem analysis. If you update the system firmware, you will have the latest available fixes and improvements for error handling, reporting, and isolation. For instructions about updating the system firmware, see Getting fixes.
What type of problem are you dealing with? Problem analysis procedure
You do not know the type of problem. Go to “Determining the problem analysis procedure to
perform.”
A baseboard management controller (BMC) access problem occurred.
The system does not power on (the power button or the BMC power on command does not power on the system).
A system firmware boot failure occurred (the system started but was not able to boot to the Petitboot menu).
A video graphics array (VGA) monitor problem occurred (the system started but video is not displayed on the monitor).
An operating system boot failure occurred (the system booted to the Petitboot menu but the operating system did not start).
A BMC dashboard sensor is red. Go to “Resolving a sensor indicator problem” on page
A processor, memory, power, or cooling hardware failure occurred.
Missing or faulty graphics processing unit (GPU), PCIe adapter, disk drive, or solid-state drive.
Go to “Resolving a BMC access problem” on page 2.
Go to “Resolving a power problem” on page 3.
Go to “Resolving a system firmware boot failure” on page 4.
Go to “Resolving a VGA monitor problem” on page 8.
Go to “Resolving an operating system boot failure” on page 9.
11. Go to “Resolving a hardware problem” on page 12.
Go to Resolving a GPU, PCIe adapter, or device problem.

Determining the problem analysis procedure to perform

Learn how to identify the correct problem analysis procedure to perform.
To determine the correct problem analysis procedure to perform, complete the following steps:
1. After you apply power to the system, do the power supply LEDs display XXX and after 30 seconds
the power button flashes?
If Then Yes: Continue with the next step. No: Go to “Resolving a power problem” on page 3.
2. Can you access the baseboard management controller (BMC) across the network?
© Copyright IBM Corp. 2015, 2019 1
If Then Yes: Continue with the next step. No: Go to “Resolving a BMC access problem.”
3. Can you boot the system to the Petitboot menu?
If Then Yes: Continue with the next step. No: Go to “Resolving a system firmware boot failure” on page 4.
4. Is video displayed on the video graphics array (VGA) monitor?
If Then Yes: Continue with the next step. No: Go to “Resolving a VGA monitor problem” on page 8.
5. Can you start the operating system?
If Then Yes: Continue with the next step. No: Go to “Resolving an operating system boot failure” on page 9.
6. On the BMC dashboard, are any sensors red?
If Then Yes: Go to “Resolving a sensor indicator problem” on page 11. No: Continue with the next step.
7. Go to “Resolving a hardware problem” on page 12. This ends the procedure.

Resolving a BMC access problem

Learn how to identify the service action that is needed to resolve a baseboard management controller (BMC) access problem.
1. Ensure that the BMC password is not set to the default password. For information about changing the
default password, see Logging on to the BMC GUI. Does the problem persist?
If Then Yes: Continue with the next step. No: This ends the procedure.
2. Are both ends of the network cable seated securely?
If Then Yes: Continue with the next step. No: Seat both ends of the cable securely. If the problem persists, continue with the next step.
3. Power off the system and disconnect all ac power cords for 30 seconds. Then, reconnect the ac power
cords and power on the system. Does the BMC access problem persist?
2 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
If Then Yes: Continue with the next step. No: This ends the procedure.
4. Verify that the BMC network settings are correct. a. Power on the system by using the power button on the front of the system. Wait 1 - 2 minutes for
the system to display the Petitboot menu.
b. When the Petitboot menu is displayed, press any key to interrupt the boot process. Then, select
Exit to Shell.
c. Type the following command and press Enter:
ipmitool lan print 1
d. Verify that the MAC address and the IP address settings are correct. Then, continue with the next
step.
Note: If the IP address setting is incorrect, go to Configuring the firmware IP address website(http://www.ibm.com/support/knowledgecenter/linuxonibm/liabw/ liabwenablenetwork.htm). If the MAC address is 00:00:00:00:00:00, go to “Contacting IBM service and support” on page 110.
5. Complete the following actions: a. Power on to the Petitboot menu. b. Use the BMC to update the system firmware. For instructions, see Updating the system firmware
by using the BMC.
Are you able to access the BMC?
If Then Yes: This ends the procedure. No: Continue with the next step.
6. Complete the service action that is indicated for your system:
v If your system is an 8335-GCA or 8335-GTA, replace the system backplane. Go to “8335-GCA and
8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure. This ends the procedure.
v If your system is an 8335-GTB, replace the BMC card. Go to “8335-GTB locations” on page 121 to
identify the physical location and the removal and replacement procedure. This ends the
procedure.
v If your system is an 8348-21C, replace the system backplane. Go to “8348-21C locations” on page
133 to identify the physical location and the removal and replacement procedure. This ends the
procedure.

Resolving a power problem

Learn how to identify the service action that is needed to resolve a power problem.
1. Is the amber LED of a power supply on solid and is the amber LED on the front of the system turned off?
If Then Yes: Ensure that the power cords for both power supplies are fully seated and that the power
distribution units (PDUs) and power outlets are supplying electricity. This ends the procedure.
No: Continue with the next step.
Beginning troubleshooting and problem analysis 3
2. Are the power supply LEDs turned off?
If Then Yes: Continue with the next step. No: Continue with step 4.
3. Perform the following actions, one at a time, until the problem is resolved: a. Ensure that all of the power cords are fully seated in the power supplies. b. Ensure that all of the power cords are fully seated in the power distribution units (PDUs) or wall
outlets.
c. If the power cords are plugged into PDUs, ensure that the PDUs are turned on. d. Ensure that all of the power cords are plugged into PDUs or wall outlets that are supplying
electricity.
e. Replace the power cords. f. Replace the power supplies.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.
4. Is the amber LED of a power supply on solid and is the red LED on the front of the system flashing at 0.25 Hz?
If Then Yes: Continue with the next step. No: Go to “Contacting IBM service and support” on page 110. This ends the procedure.
5. Perform the following actions, one at a time, until the problem is resolved: a. Ensure that the power supply is fully seated in the system. b. Ensure that the power supply fan is not blocked. c. Replace the power supply.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.

Resolving a system firmware boot failure

Learn how to identify the service action that is needed to resolve a failure while booting your system firmware.
1. After you pressed the power button, did the system turn on but fail to display the Petitboot menu?
If Then Yes: Continue with the next step.
4 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
If Then No: Continue with step 5.
2. Does the baseboard management controller (BMC) respond to commands?
Note: To determine whether the BMC responds to commands, run the following ipmitool command:
ipmitool -I lanplus -U <username> -P <password> -H <bmc ip or bmc hostname> chassis status
If Then Yes: Continue with the next step. No: Continue with step 4.
3. Complete the following actions: a. Use the BMC to update the system firmware. For instructions, see Updating the system firmware
by using the BMC.
b. Check the system event logs. For instructions, see “Identifying a service action by using system
event logs” on page 27. Then, continue with step 5.
4. Complete the following actions, one at a time, until the problem is resolved: a. Reset the BMC remotely by entering the following command:
ipmitool -I lanplus -U <username> -P <password> -H <bmc ip or bmc hostname> mc reset cold
b. Disconnect the power cords from the system for 30 seconds. Reconnect the power cords, wait 5
minutes, and then go to step 2.
c. Use the IPMI tool to update the system firmware. For instructions, see Updating the system
firmware by using the IPMI tool.
d. Complete the service action that is indicated for your system:
v If your system is an 8335-GCA or 8335-GTA, replace the system backplane. Go to “8335-GCA
and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, replace the BMC card. Go to “8335-GTB locations” on page 121
to identify the physical location and the removal and replacement procedure.
v If your system is an 8348-21C, replace the system backplane. Go to “8348-21C locations” on
page 133 to identify the physical location and the removal and replacement procedure.
This ends the procedure.
5. Are you here because of a system event log (SEL) with the value OEM record c0 and OEM c0 specific log information 3a1503xxxxxx?
If Then Yes: Continue with step 8 on page 6. No: Continue with the next step.
6. Are you here because of a SEL event with the value OEM record c0 and OEM c0 specific log information 3a1504xxxxxx?
If Then Yes: Continue with step 12 on page 7. No: Continue with the next step.
7. Power off the system and disconnect all ac power cords for 30 seconds. Then, reconnect the ac power cords and power on the system. Does the system boot successfully?
Beginning troubleshooting and problem analysis 5
If Then Yes: This ends the procedure. No: Go to “Resolving a hardware problem” on page 12. This ends the procedure.
8. Did the system complete the boot process successfully?
If Then Yes: Continue with the next step. No: Continue with step 12 on page 7.
9. Determine whether the system is booted from the user-updated level of the system firmware image
(primary side) or the manufacturing level of the system firmware image (golden side). v For in-band networks, enter the following command:
ipmitool sensor list | grep -i golden
v To run the command remotely over the LAN, enter the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sensor list | grep -i golden
Do both of the returned records show 0x0080 in the data fields?
If Then Yes: The error was temporary. No service action is required. This ends the procedure. No: One or both of the returned records have 0x0180 in the data fields. The system was booted
from the golden side. Continue with the next step.
10. Search for processor deconfiguration SEL events that have a time stamp in close proximity to the
time stamp of the event with value OEM record c0 that sent you here. Processor deconfiguration SEL events are displayed in the following form:
v Processor CPU Func x | Transition to Non-recoverable | Asserted Are processor deconfiguration events present?
If Then Yes: Complete the service actions for the processor deconfiguration events.
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using
sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends
the procedure.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and
event information for the 8335-GTB” on page 57. This ends the procedure.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and
event information for the 8348-21C” on page 78. This ends the procedure.
No: Continue with the next step.
11. Are there other types of SEL events that require a service action and have a time stamp in close
proximity to the time stamp of the event with value OEM record c0 that sent you here?
6 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
If Then Yes: Complete the service actions for the SEL events that require service actions.
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using
sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends
the procedure.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and
event information for the 8335-GTB” on page 57. This ends the procedure.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and
event information for the 8348-21C” on page 78. This ends the procedure.
No: If the boot problem persists, reload or update the system firmware image. Go to Getting
fixes and reload the system firmware with the same level of firmware or update the system firmware with a more recent level of firmware. Then, reboot the system. This ends the
procedure.
12. Search for processor deconfiguration SEL events that have a time stamp in close proximity to the
time stamp of the event with value OEM record c0 that sent you here. Processor deconfiguration SEL events are displayed in the following form:
v Processor CPU Func x | Transition to Non-recoverable | Asserted Are processor deconfiguration events present?
If Then Yes: Complete the service actions for the processor deconfiguration events.
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using
sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends
the procedure.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and
event information for the 8335-GTB” on page 57. This ends the procedure.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and
event information for the 8348-21C” on page 78. This ends the procedure.
No: Continue with the next step.
13. Are there other types of SEL events that require a service action and have a time stamp in close
proximity to the time stamp of the event with value OEM record c0 that sent you here?
If Then Yes: Complete the service actions for the SEL events that require service actions.
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using
sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends
the procedure.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and
event information for the 8335-GTB” on page 57. This ends the procedure.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and
event information for the 8348-21C” on page 78. This ends the procedure.
No: Continue with the next step.
14. Power off the system and disconnect all AC power cords for 30 seconds. Then, reconnect the AC
power cords and power on the system. Does the system boot successfully?
If Then Yes: This ends the procedure. No: Continue with the next step.
Beginning troubleshooting and problem analysis 7
15. Is the system an 8348-21C, and are all 32 of the DIMM locations populated with 32 GB DIMMs?
If Then Yes: Continue with the next step. No: Go to step 18.
16. Use the baseboard management controller (BMC) to update the system firmware. For instructions,
see Updating the system firmware by using the BMC. Does the problem persist?
If Then Yes: Continue with the next step. No: This ends the procedure.
17. Is your system is an 8335-GTB?
If Then Yes: Replace the Baseboard management controller (BMC) card. Go to “8335-GTB locations” on
page 121 to identify the physical location and the removal and replacement procedure. If the problem persists, continue with the next step. Otherwise, this ends the procedure.
No: Continue with the next step.
18. Replace the system backplane.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure. Then, continue with the next step.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure. Then, continue with the next step.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure. Then, continue with the next step.
19. Does the problem persist?
If Then Yes: Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and
support” on page 110. This ends the procedure.
No: This ends the procedure.

Resolving a VGA monitor problem

Learn how to identify the service action that is needed to resolve a video graphics array (VGA) monitor problem.
1. Is the system powered on and is the VGA monitor connected to the VGA display port, but video is
not displayed?
If Then Yes: Continue with the next step. No: This ends the procedure.
2. Complete the following steps, one at a time until the problem is resolved: a. Ensure that the VGA cable is properly seated to the server port and to the monitor port.
8 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
b. Verify that the monitor and the VGA cable are working properly by testing them on a system that
is known to be working properly. If the monitor or the VGA cable does not work properly, replace it.
c. Verify that the system is powered on by activating a serial over LAN (SOL) session through the
baseboard management controller (BMC). If the system is not active, go to “Resolving a system firmware boot failure” on page 4.
d. Replace the system backplane.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
This ends the procedure.

Resolving an operating system boot failure

Learn how to identify the service action that is needed to resolve a failure while booting your operating system.
1. Was the system recently installed, serviced, moved, or upgraded?
If Then Yes: Ensure that all cables are properly seated in the connection path to the designated boot
device. This ends the procedure.
No: Continue with the next step.
2. Are you booting the operating system from a network location?
If Then Yes: Continue with the next step. No: Continue with step 4.
3. Complete the following actions, one at a time, until the problem is resolved: a. Ensure that a problem does not exist with the connection to the network location. b. Ensure that the adapter has a valid IP address for the network. c. Replace the network adapter.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on
page 111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
4. Petitboot displays all recognized bootable images to use by default. Is the boot image recognized by Petitboot?
If Then Yes: Continue with step 11 on page 11. No: Select the Petitboot menu option to refresh the boot images. If the problem persists,
continue with the next step.
Beginning troubleshooting and problem analysis 9
5. Is the system an 8348-21C, and is the boot image on a storage device that is configured in a RAID
configuration?
If Then Yes: Continue with the next step. No: Continue with step 11 on page 11.
6. On the Petitboot command line, type the following command:
arcconf getconfig 1 LD
Is the logical boot drive recognized and in optimal status?
If Then Yes: Reinstall the operating system on the logical drive. This ends the procedure. No: Continue with the next step.
7. Are the drives properly seated in their respective drive bays?
Note:
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
If Then Yes: Continue with the next step. No: Properly seat the drives in the drive bays. Then, go to step 4 on page 9.
8. Refresh the Petitboot boot options. Is the boot image on the logical drive recognized?
If Then Yes: Boot the operating system. Then, continue with step 11 on page 11. No: Continue with the next step.
9. Verify that the physical drives are in the RAID array. On the Petitboot command line, type the
following command:
arcconf getconfig 1 PD
Are the physical drives that are known to be in the RAID array recognized?
If Then Yes: Reinstall the operating system on the logical drive. This ends the procedure. No: Continue with the next step.
10. Complete the following actions, one at a time, until the physical drives are recognized in the RAID
array:
Note:
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page
111 to identify the physical location and the removal and replacement procedure.
10 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical
location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical
location and the removal and replacement procedure.
a. Ensure that the SAS cable is securely seated in the RAID adapter and the storage backplane. b. Replace the RAID adapter. c. Replace the SAS cable.
This ends the procedure.
11. Does an operating system error occur during the boot?
If Then Yes: Recover the operating system with the tools provided for the operating system. If that does
not resolve the problem, reinstall the operating system. This ends the procedure.
No: Reinstall the operating system. This ends the procedure.

Resolving a sensor indicator problem

Learn how to resolve a sensor indicator problem by using the BMC dashboard.
After the system is powered on, some sensors retain their status from the last time the system was operational. As a result, the sensor indicator LED might not reflect the status of the physical sensor, and it can be unclear whether the sensor indicator LED indicates an actual problem that requires a service action. For more information about BMC dashboard sensors on an 8335-GCA or 8335-GTA, see Event sensor status GUI display. For more information about BMC dashboard sensors on an 8335-GTB, see Event sensor status GUI display. For more information about BMC dashboard sensors on an 8348-21C, see Event sensor status GUI display.
To refresh the sensor indicator LEDs and to determine whether a service action is required, complete the following procedure:
1. Power off the system. Then, boot the system to the operational state. Click Refresh on the BMC
dashboard. Are any of the sensor indicator LEDs still red?
v Yes: Continue with the next step. v No: This ends the procedure.
2. Record the names of any sensors that have a red LED indicator status.
Note: Repeat steps 3 - 6 for every sensor that you record in this step.
3. Use one of the following commands to list the sensor event logs (SELs).
v To list SELs by using an in-band network, enter the following command:
ipmitool sel elist
v To list SELs remotely over the LAN, enter the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel elist
4. Review the list of SELs and locate the log entry that meets the following criteria:
v The name of any of the sensors you recorded in step 2. v A service action keyword is present. For a list of service action keywords, see “Identifying service
action keywords in system event logs” on page 36.
v Asserted is in the description. Did you identify a log entry that meets the above criteria? v Yes: Continue with the next step.
Beginning troubleshooting and problem analysis 11
v No: Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and
support” on page 110. This ends the procedure.
5. Use one of the following options to display the SEL details for the sensor:
Note: You must specify the SEL record ID in hexadecimal format. For example: 0x1a. v To display SEL details by using an in-band network, enter the following command:
ipmitool sel get <SEL record ID>
v To display SEL details remotely over the LAN, enter the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel get <SEL record ID>
6. The sensor ID field contains sensor information in the sensor name (sensor ID) format. Record the
sensor name, sensor ID, and event description. Then, use this information to determine the service action to perform:
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor and
event information for the 8335-GCA and 8335-GTA” on page 37 to determine the service action to perform. This ends the procedure.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event
information for the 8335-GTB” on page 57 to determine the service action to perform. This ends the
procedure.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event
information for the 8348-21C” on page 78 to determine the service action to perform. This ends the
procedure.

Resolving a hardware problem

Learn how to identify the service action that is needed to resolve a hardware problem.
1. If you have not already done so, manually boot the system.
2. Go to “Identifying a service action by using system event logs” on page 27. Then, continue with the
next step.
3. Was a service action identified?
If Then Yes: Continue with the next step. No: Go to step 5.
4. Did the service action fix the problem?
If Then Yes: This ends the procedure. No: Go to step 5.
5. Go to “Resolving a GPU, PCIe adapter, or device problem” on page 13. Then, continue with the next
step.
6. Was a service action identified?
If Then Yes: Continue with the next step. No: Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and
support” on page 110. This ends the procedure.
7. Did the service action fix the problem?
12 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
If Then Yes: This ends the procedure. No: Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and
support” on page 110. This ends the procedure.

Resolving a GPU, PCIe adapter, or device problem

Learn how to access log files, information to identify types of events, and a list of potential problems and service actions.
1. Are all of the adapters in the system missing or failed?
If Then Yes: Replace the system backplane.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations”
on page 111 to identify the physical location and the removal and replacement procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the
physical location and the removal and replacement procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the
physical location and the removal and replacement procedure.
No: Continue with the next step.
2. To identify the correct service procedure to perform by using operating system log information,
complete the following steps:
a. Log in as the root user. b. At the command prompt, type dmesg and press Enter.
3. Scan the operating system logs for the first occurrence of keywords, such as fail, failure, or failed.
When you find a keyword that accompanies one or more of the resource names in the following table, a service action is required. Use the following table to determine the service procedure to perform for your type of problem.
Table 1. Resource names, examples, and service procedures for different types of operating system logs.
Example of a log requiring
Resource name
aacraid PCI error detected 2 RAID
eth1, eth2, eth3 Failed to re-initialize
NVRM aborting RmInitAdapter
nvidia-nvlink IBMNPU: NPU FENCE
nvme Failed status: ffffffff,
a service action Type of problem Service procedure
Go to “Resolving a RAID
device
failed!
detected, machine power cycle required
reset controller
Note: This adapter is available only for 8348-21C systems.
Network Go to “Resolving a network
Graphics Go to “Resolving a
Graphics Go to “Resolving a
NVMe Flash adapter Note: This adapter is available only for 8335-GCA systems.
Beginning troubleshooting and problem analysis 13
adapter problem” on page
14.
adapter problem” on page
15.
graphics processing unit problem” on page 16.
graphics processing unit problem” on page 16.
Go to “Resolving an NVMe Flash adapter problem” on page 19.
Table 1. Resource names, examples, and service procedures for different types of operating system logs. (continued)
Example of a log requiring
Resource name
ata1, ata2 SError: { RecovComm
sda, sdb, sdc FAILED Result Storage
a service action Type of problem Service procedure
PHYRdyChg 10B8B Dispar }
Marvell storage adapter Note: This adapter is available only for 8348-21C systems.
Go to “Resolving a storage device problem” on page
20.

Resolving a RAID adapter problem

Learn about the possible problems and service actions that you can perform to resolve a RAID adapter problem.
Note: To determine the location of the PCIe adapter, see “Identifying the location of the PCIe adapter by using the slot number” on page 21.
Table 2. RAID adapter problems and service actions.
Problem Service action
System unable to find adapter
Adapter stops working suddenly
1. Verify that the adapter is properly seated in a
compatible slot.
2. Install the adapter in a different compatible slot.
3. Verify that the drivers for the adapter are installed.
4. Verify that the most recent firmware is installed on
the system. Otherwise, install the most recent firmware if it is not already installed.
5. Restart the system.
6. Replace the adapter.
7. Replace the system backplane.
8. Replace the central processing unit (CPU).
1. If the system was recently installed, moved, serviced,
or upgraded, verify that the adapter is seated properly and all associated cables are connected correctly.
2. Inspect the PCIe socket and verify that there is no
dirt or debris in the socket.
3. Inspect the card and verify that it is not physically
damaged.
4. Verify that all cables are properly seated and are not
physically damaged. If you recently added one or more new adapters, remove them and then test to determine whether the failing adapter is functioning properly again. If the RAID adapter is functioning again, review the IBM support tips to confirm that there are no PCI address, driver, or firmware conflicts. Then, reinstall the new adapters again one at a time until all adapters function properly.
5. Replace the adapter.
6. Replace the system backplane.
7. Replace the CPU.
14 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 2. RAID adapter problems and service actions (continued).
Problem Service action
One or more drives are not recognized
Other problems For information about adapter diagnostics, see
1. If more than one drive is not recognized, verify that
the cables are properly attached to the RAID card.
2. Verify that the drive or drives are fully seated in the
system.
3. Verify that all of the cables that attach to the
backplane are properly seated.
4. Verify that the drive or drives are compatible with
the RAID adapter.
5. Verify that the most recent firmware is installed for
the RAID adapter, or install the most recent firmware if it is not already installed.
6. If more than one drive is not recognized, replace the
drive.
7. Replace the RAID adapter.
8. Replace the system backplane.
9. Replace the cable or cables.
Supporting diagnostics. For information about adapter user information, see “User guides for GPUs and PCIe adapters” on page 25.

Resolving a network adapter problem

Learn about the possible problems and service actions that you can perform to resolve a network adapter problem.
Note: To determine the location of the PCIe adapter, see “Identifying the location of the PCIe adapter by using the slot number” on page 21.
Table 3. Network adapter problems and service actions.
Problem Service action
System unable to find adapter
1. Verify that the adapter is properly seated in a
compatible slot.
2. Install the adapter in a different compatible slot.
3. Verify that the drivers for the adapter are installed.
4. Verify that the most recent firmware is installed on
the system. Otherwise, install the most recent firmware if it is not already installed.
5. Restart the system.
6. Replace the adapter.
7. Replace the system backplane.
8. Replace the central processing unit (CPU).
Beginning troubleshooting and problem analysis 15
Table 3. Network adapter problems and service actions (continued).
Problem Service action
Adapter stops working suddenly
Link indicator light on the adapter is off
Link light on the adapter is on, but there is no communication from the adapter
Other problems For information about adapter diagnostics, see
1. If the system was recently installed, moved, serviced,
or upgraded, verify that the adapter is seated properly and all associated cables are correctly connected.
2. Inspect the PCIe socket and verify that there is no
dirt or debris in the socket.
3. Inspect the card and verify that it is not physically
damaged.
4. Verify that all cables are properly seated and are not
physically damaged. If you recently added one or more new adapters, remove them and then test to determine whether the failing adapter is functioning properly again. If the network adapter is functioning again, review the IBM support tips to confirm that there are no PCI address, driver, or firmware conflicts. Then, reinstall the new adapters again one at a time until all adapters function properly.
5. Replace the adapter.
6. Replace the system backplane.
7. Replace the CPU.
1. Verify that the cable functions properly by testing it
with a known working connection.
2. Verify that the port or ports on the switch are
enabled and functional.
3. Verify that the switch and adapter are compatible.
4. Replace the adapter.
1. Verify that the most recent driver is installed, or
install the most recent driver if it is not already installed.
2. Verify that the adapter and its link have compatible
settings, such as speed and duplex configuration.
Supporting diagnostics. For information about adapter user information, see “User guides for GPUs and PCIe adapters” on page 25.

Resolving a graphics processing unit problem

Learn about the possible problems and service actions that you can perform to resolve a graphics processing unit (GPU) problem.
Note: To determine the location of the GPU, see “Identifying the location of the GPU” on page 22.
16 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 4. GPU problems and service actions for the 8335-GCA or 8335-GTA
Problem Service action
System unable to find GPU
1. Verify that the GPU is properly seated in a
compatible slot.
2. Install the GPU in a different compatible slot.
3. Verify that the drivers for the GPU are installed.
4. Verify that the most recent firmware is installed on
the system. Otherwise, install the most recent firmware if it is not already installed.
5. Restart the system.
6. If the GPU is still missing, replace the following
items, one at a time, until the problem is resolved: Note: Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.
a. GPU b. System processor modules c. System backplane
GPU stops working suddenly
1. If the system was recently installed, moved, serviced,
or upgraded, verify that the GPU is seated properly and all associated cables are connected correctly.
2. Inspect the PCIe socket and verify that there is no
dirt or debris in the socket.
3. Inspect the card and verify that it is not physically
damaged.
4. Verify that all cables are properly seated and are not
physically damaged. If you recently added one or more new adapters, remove them and then test to determine whether the failing adapter is functioning properly again. If the graphics adapter is functioning again, review the IBM support tips to confirm that there are no PCI address, driver, or firmware conflicts. Then, reinstall the new adapters again one at a time until all adapters function properly.
5. If the GPU is still not working, replace the following
items, one at a time, until the problem is resolved: Note: Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.
a. GPU b. System processor modules c. System backplane
Other problems For information about adapter diagnostics, see
Supporting diagnostics. For information about adapter user information, see “User guides for GPUs and PCIe adapters” on page 25.
Beginning troubleshooting and problem analysis 17
Table 5. GPU problems and service actions for the 8335-GTB
Problem Service action
System unable to find GPU
Fence errors in the operating system log
1. Verify that the GPU is properly seated.
2. Verify that the drivers for the GPU are installed.
3. Verify that the most recent firmware is installed on
the system. Otherwise, install the most recent firmware if it is not already installed.
4. Restart the system.
5. If the GPU is still missing, replace the following
items, one at a time, until the problem is resolved: Note: Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.
a. GPU b. System processor modules c. System backplane
1. Restart the system. Do fence errors continue to be
logged in the operating system log?
v Yes: Continue with the next step. v No: This ends the procedure.
2. Does NPU chip 0 appear in the fence error log entry?
v Yes: Continue with the next step. v No: Go to step 4.
3. Replace the following items, one at a time, until the
problem is resolved: Note: Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.
a. CPU 1 b. GPU 2 c. GPU 1 d. System backplane
This ends the procedure.
4. Does NPU chip 1 appear in the fence error log entry?
v Yes: Continue with the next step. v No: Go to “Contacting IBM service and support”
on page 110. This ends the procedure.
5. Replace the following items, one at a time, until the
problem is resolved: Note: Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.
a. CPU 2 b. GPU 4 c. GPU 3 d. System backplane
This ends the procedure.
18 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 5. GPU problems and service actions for the 8335-GTB (continued)
Problem Service action
GPU stops working suddenly
Other problems For information about adapter diagnostics, see
1. If the system was recently installed, moved, serviced,
or upgraded, verify that the GPU is seated properly.
2. Inspect the GPU and verify that it is not physically
damaged.
3. If the GPU is still not working, replace the following
items, one at a time, until the problem is resolved: Note: Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.
a. GPU b. System processor modules c. System backplane
Supporting diagnostics. For information about adapter user information, see “User guides for GPUs and PCIe adapters” on page 25.

Resolving an NVMe Flash adapter problem

Learn about the possible problems and service actions that you can perform to resolve a Non-Volatile Memory Express (NVMe) Flash adapter problem.
If you suspect a problem with a PCIe3 1.92 TB CAPI NVMe Flash accelerator adapter (FC EJ1K; CCIN 58CD), see PCIe3 1.92 TB CAPI NVMe Flash Accelerator Adapter (FC EJ1K; CCIN 58CD).
If you suspect a problem with an NVMe Flash adapter, use the following table to determine the service action to perform.
Note: To determine the location of the NVMe Flash adapter, see “Identifying the location of the NVMe Flash adapter” on page 23.
Table 6. NVMe Flash adapter problems and service actions
Problem Service action
System is unable to find the NVMe Flash adapter
1. If the NVMe Flash adapter has an amber LED that is flashing or is on solid, replace the
adapter. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure. Important: Before you remove an NVMe Flash adapter, ensure that you back up all data on the adapter or the array that contains the adapter. After you replace the adapter, restore the data.
2. If the system was recently installed, moved, serviced, or upgraded, verify that the NVMe
Flash adapter is seated and installed properly.
3. Verify that the NVMe Flash adapter is compatible with the system.
4. Verify that the most recent firmware is installed on the system. Otherwise, install the
most recent firmware if it is not already installed.
5. Replace the NVMe Flash adapter. Go to “8335-GCA and 8335-GTA locations” on page 111
to identify the physical location and removal and replacement procedure. Important: Before you remove an NVMe Flash adapter, ensure that you back up all data on the adapter or the array that contains the adapter. After you replace the adapter, restore the data.
Beginning troubleshooting and problem analysis 19
Table 6. NVMe Flash adapter problems and service actions (continued)
Problem Service action
NVMe Flash adapter stops working suddenly
Maximum write capability of an NVMe Flash adapter is depleted
Other problems
1. If the NVMe Flash adapter has an amber LED that is flashing or is on solid, replace the
adapter. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure. Important: Before you remove an NVMe Flash adapter, ensure that you back up all data on the adapter or the array that contains the adapter. After you replace the adapter, restore the data.
2. Check the system logs to verify whether the system detected a problem.
3. Replace the NVMe Flash adapter. Go to “8335-GCA and 8335-GTA locations” on page 111
to identify the physical location and removal and replacement procedure. Important: Before you remove an NVMe Flash adapter, ensure that you back up all data on the adapter or the array that contains the adapter. After you replace the adapter, restore the data.
To determine whether the maximum write capability of a PCIe3 1.6 TB NVMe Flash adapter is depleted, see PCIe3 1.6 TB NVMe Flash adapter (FC EC54; CCIN 58CB). To determine whether the maximum write capability of a PCIe3 3.2 TB NVMe Flash adapter is depleted, see PCIe3 3.2 TB NVMe Flash adapter (FC EC56; CCIN 58CC). If you determine that the adapter must be replaced, go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure. Important: Before you remove an NVMe Flash adapter, ensure that you back up all data on the adapter or the array that contains the adapter. After you replace the adapter, restore the data.
1. Check for and resolve any nvmeX entries in the operating system log, where nvmeX is
the resource name of the NVMe Flash adapter. Then, test the NVMe Flash adapter again.
2. Ensure that the latest I/O adapter firmware is installed. For instructions, see Getting
firmware fixes for IBM I/O adapters by using Fix Central.
3. Ensure that you have the latest device driver service updates by installing the latest Linux
distribution fixes.
4. Type the following command and press Enter:
nvme smart-log /dev/nvmeX, where nvmeX is the resource name of the NVMe Flash adapter.
Check for problems with the critical warning, temperature, available spare, percentage used, power cycles, or power on hours fields. Note: For more information about nvme commands, type man nvme and press Enter.

Resolving a storage device problem

Learn about the possible problems and service actions that you can perform to resolve a storage device problem.
Note: To determine the location of the storage device, see “Identifying the location of the storage device” on page 24.
20 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 7. Storage device problems and service actions
Problem Service action
System is unable to find a storage device that is at the front of the system
System is unable to find a storage device that is at the rear of the system (8348-21C only)
Drive stops working suddenly
Other problems Check the messages and resolve any other problems that
1. If the system was recently installed, moved, serviced,
or upgraded, verify that the device is seated and installed properly.
2. Verify that the device is compatible with your system.
3. Verify that all internal cables are properly seated and
are not physically damaged.
4. Verify that the most recent firmware is installed on
the system. Otherwise, install the most recent firmware if it is not already installed.
5. Replace the drive.
6. If your system is a 8348-21C, replace the system
backplane or the storage mezzanine card.
7. Replace the cable.
8. If you have a RAID adapter installed, replace it.
If the system is unable to find one storage device that is at the rear of the system, replace the following items, one at a time until the problem is resolved:
v Drive v Drive tray v System backplane
If the system is unable to find more than one storage device that is at the rear of the system, replace the following items, one at a time until the problem is resolved:
v Drive tray v System backplane
1. Verify that all internal cables are properly seated and
are not physically damaged.
2. Check the system logs to verify whether the system
detected a problem.
3. Replace the drive.
4. If your system is a 8348-21C, replace the system
backplane or the storage mezzanine card.
5. Replace the cable.
6. If you have a RAID adapter that is installed, replace
it.
were detected. Then, test the drive again. If the drive continues not to function, refer to the documentation for the drive.

Identifying the location of the PCIe adapter by using the slot number

The error message provides information to help you to determine the location of the PCIe adapter.
For example, the log might contain an error message similar to the following text:
[131779.752714] EEH: PHB#0 failure detected, location: Slot5
Beginning troubleshooting and problem analysis 21
Use the following table to map the slot number information in the operating system log to the PCIe adapter description and service action.
Table 8. Slot numbers, adapter descriptions, and service action for the 8335-GCA or 8335-GTA.
Slot information from the log PCIe adapter description Service action
Slot1 PCIe adapter 1 Replace the PCIe adapter indicated in Slot2 PCIe adapter 2 Slot3 PCIe adapter 3 Slot4 PCIe adapter 4 Slot5 PCIe adapter 5
Table 9. Slot numbers, adapter descriptions, and service action for the 8335-GTB
Slot information from the log PCIe adapter description Service action
Slot1 PCIe adapter 1 Replace the PCIe adapter indicated in Slot2 PCIe adapter 2 Slot3 PCIe adapter 3
Table 10. Slot numbers, adapter descriptions, and service action for the 8348-21C.
Slot information from the log PCIe adapter description Service action
Slot1 PCIe adapter 1 Replace the PCIe adapter indicated in Slot2 PCIe adapter 2 Slot3 PCIe adapter 3 Slot4 PCIe adapter 4
the PCIe adapter description column. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.
the PCIe adapter description column. Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.
the PCIe adapter description column. Go to “8348-21C locations” on page 133 to identify the physical location and the removal and replacement procedure.

Identifying the location of the GPU

The error message provides information to help you to determine the location of the graphics processing unit (GPU).
On an 8335-GCA or 8335-GTA system, the log might contain an error message similar to the following text:
EEH: PHB#0 failure detected, location: Slot5
On an 8335-GTB system, the log might contain an error message similar to the following text:
EEH: PHB#0 failure detected, location: GPU1
If you have an 8335-GTB system with Red Hat Enterprise Linux 7.4 or later, and if you get an error message with only PCI bus information (for example, 0002:01:00.0), you can determine the GPU slot information by using the lshw command. Complete the following steps:
1. Record the PCI bus information that is in the error message.
2. Log in to the operating system with root authority.
3. Type the following command and press Enter:
lshw -class display
4. Determine the GPU slot that is associated with the PCI bus information that you recorded in step 1.
22 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Use the following table to map the slot or GPU number information in the operating system log to the GPU description and service action. This ends the procedure.
Table 11. Slot numbers, GPU descriptions, and service action for the 8335-GCA or 8335-GTA
Slot number information from the log GPU description Service action
Slot5 GPU 2 Replace the GPU indicated in the Slot2 GPU 1
Table 12. GPU numbers, GPU descriptions, and service action for the 8335-GTB
GPU number information from the log GPU description Service action
GPU1 GPU 1 Replace the GPU indicated in the GPU2 GPU 2 GPU3 GPU 3 GPU4 GPU 4
GPU description column. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.
GPU description column. Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.

Identifying the location of the NVMe Flash adapter

Use this procedure to identify the location of a Non-Volatile Memory Express (NVMe) Flash adapter.
1. Does the operating system log contain the slot number? For example, the log might contain an error
message similar to the following text:
[131779.752714] EEH: PHB#0 failure detected, location: Slot1
If Then Yes: If your system is an 8335-GCA, use Table 13 on page 24 to map the slot number
information in the operating system log to the PCIe adapter description and service action. If your system is an 8335-GTB, use Table 14 on page 24 to map the slot number information in the operating system log to the PCIe adapter description and service action. This ends
the procedure.
No: Continue with the next step.
2. Locate the NVMe Flash adapter by using the PCI address: a. The operating system log contains information about the NVMe Flash adapter in the form of a PCI
address. Record the PCI address information for the NVMe Flash adapter that has failed. For example, in the operating system log message nvme 0006:01:00.0: Failed status: ffffffff, reset controller, the PCI address of the failing NVMe Flash adapter is 0006:01:00.0.
b. At the command line, type lscfg -vl pciaddress, where pciaddress is the NVMe Flash adapter
information that you recorded in step 2.a. Then, press Enter.
c. Record the slot number information that is in the location code field. d. If your system is an 8335-GCA, use Table 13 on page 24 to map the slot number information to the
PCIe adapter description and service action. If your system is an 8335-GTB, use Table 14 on page 24 to map the slot number information to the PCIe adapter description and service action. This
ends the procedure.
Beginning troubleshooting and problem analysis 23
Table 13. Slot numbers, adapter descriptions, and service action for the 8335-GCA
Slot information from the log PCIe adapter description Service action
Slot1 PCIe adapter 1 Replace the NVMe Flash adapter indicated in the PCIe Slot3 PCIe adapter 3 Slot4 PCIe adapter 4
Table 14. Slot numbers, adapter descriptions, and service action for the 8335-GTB
Slot information from the log PCIe adapter description Service action
Slot1 PCIe adapter 1 Replace the NVMe Flash adapter indicated in the PCIe Slot2 PCIe adapter 2 Slot3 PCIe adapter 3
adapter description column. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.
adapter description column. Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.

Identifying the location of the storage device

Use this procedure to identify the location of a storage device.
1. Is there a disk drive or solid-state drive with an amber fault LED turned on solid?
If Then Yes: Continue with step 2. No: Continue with step 3.
2. Replace the disk drive or solid-state drive.
v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page 111
to identify the removal and replacement procedure. This ends the procedure.
v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the removal and
replacement procedure. This ends the procedure.
v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the removal and
replacement procedure. This ends the procedure.
3. Is the system an 8335-GCA, 8335-GTA, or 8335-GTB?
If Then Yes: Continue with step 4. No: Continue with step 5.
4. The storage device location is determined in the drive removal and replacement procedures for your
system. Use the following table to find the correct removal and replacement procedure. This ends the
procedure.
Table 15. Drive removal and replacement procedures
System Drive removal and replacement procedures
8335-GCA or 8335-GTA See Removing and replacing a disk drive in the
8335-GCA or 8335-GTA with the system power turned on.
8335-GTB See Removing and replacing a disk drive in the
8335-GTB.
5. The system is an 8348-21C. Are the devices controlled by a RAID adapter?
24 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
If Then Yes: Continue with step 6. No: Continue with step 9.
6. To locate the device by using the identify LED, complete the following steps: a. The operating system log contains information about the device in the form sdx, where x is the
letter associated with the drive that failed. Record the sdx information for the device that failed. For example, the failing device in the following operating system log is sdb:[ 2614.698832]
blk_update_request: I/O error, dev sdb, sector 131072
b. At the command prompt, type hdparm -i /dev/sdx, where sdx is the device information recorded
in step 6a. Then, press Enter.
c. Record the serial number of the device. d. At the command prompt, type arcconf getconfig 1 PD and press Enter. Find the reported channel
and device numbers for the device that has the same serial number that you recorded in the previous step. Record the reported channel and device numbers.
e. At the command prompt, type arcconf identify 1 device x y start, where x is the reported
channel number and y is the reported device number that you recorded in the previous step. Then, press Enter.
Is the identify LED for one of the devices flashing?
If Then Yes: Continue with the next step. No: Continue with step 9.
7. Replace the device with the flashing identify LED. Go to “8348-21C locations” on page 133 to identify the removal and replacement procedure. After you have replaced the device, continue with the next step.
8. At the command prompt, type arcconf identify 1 device x y stop, where x is the reported channel number and y is the reported device number that you recorded in step 6d. Then, press Enter. This
ends the procedure.
9. To locate the device by using the device serial number, complete the following steps: a. The operating system log contains information about the device in the form sdx, where x is the
letter associated with the drive that failed. Record the sdx information for the device that failed. For example, the failing device in the following operating system log is sdb:[ 2614.698832]
blk_update_request: I/O error, dev sdb, sector 131072
b. At the command prompt, type hdparm -i /dev/sdx, where sdx is the device information recorded
in step 9a. Then, press Enter.
c. Record the serial number of the device. d. Power off the system. Remove one device at a time until you identify the device with the serial
number identified in step 9c. Replace only the device with the matching serial number. Reinstall the other devices. Go to “8348-21C locations” on page 133 to identify the removal and replacement procedure. This ends the procedure.

User guides for GPUs and PCIe adapters

Use this information to find the user guide for your graphics processing unit (GPU) or PCIe adapter.
Use the following table to find the user guide for the GPU or PCIe adapter that you are using.
Beginning troubleshooting and problem analysis 25
Table 16. GPU and PCIe adapter user guides
Name User guide
Broadcom Broadcom website (http://www.broadcom.com) Emulex Emulex website (http://www.emulex.com/products/
ethernet-networking-storage-connectivity/ethernet­networking-adapters/ibm-branded/selection-guide/)
Marvell Marvell website (http://www.marvell.com/storage/
system-solutions/sata-controllers/)
Mellanox Mellanox Technologies website (http://
mymellanox.force.com/support/VF_SerialSearch) NVIDIA NVIDIA website (http://www.nvidia.com) PMC-Sierra PMC-Sierra website (http://www.nvidia.com) QLogic QLogic website (http://driverdownloads.qlogic.com/
QLogicDriverDownloads_UI/IBM_Search.aspx)

Resolving an over temperature problem for a water-cooled 8335-GTB system

Learn how to identify the service action that is needed to resolve an over temperature problem.
1. Go to Water cooling system specification and requirements. Are all of the requirements for
water-cooled systems met?
Note: For information specific to the 8335-GTB, see Model 8335-GTB water cooling option (Feature code E2RD).
If Then Yes: Continue with the next step. No: Work with the customer to ensure that all of the requirements for water-cooled systems are
met. This ends the procedure.
2. Is the room temperature less than 40°C (104°F)?
If Then Yes: Continue with the next step. No: Notify the customer. The customer must bring the room temperature within normal range.
Continue with the next step.
3. Ensure that the following requirements are met: a. The quick-connects between the 8335-GTB system and the water manifold are mated and
connected to the proper circuits of the manifold. The supply hose must be connected to the supply manifold circuit, which is the manifold circuit that is located toward the inside of the rack. The return hose must be connected to the return manifold circuit, which is the manifold circuit that is located toward the outside of the rack.
b. The facility water supply hose is properly connected to the supply hose on the manifold and the
return hose on the manifold is properly connected to the facility water return hose. v The ball valves that connect the facility water supply hose to the manifold supply hose and the
facility water return hose to the manifold return hose are open. For more information about connecting the facility water hoses to the manifold hoses, see Replacing the water manifold in the 8335-GTB.
26 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
v All of the valves that might restrict the flow of water through the hoses are open in the facility
water system.
v The pumping unit of the facility water system is on and does not have errors.
c. The facility water system is supplying water at the required temperature and flow. For
instructions, see Model 8335-GTB water cooling option (Feature code E2RD).
Does the problem persist?
If Then Yes: Continue with the next step.
Note: Steps 1- 3 resolve most problems. Ensure that you carefully check steps 1 - 3 before
you continue with the next step.
No: This ends the procedure.
4. Is a processor over heating, but the other processor and the graphics processing units (GPUs) are not
over heating?
If Then Yes: Check the thermal interface material (TIM) between the cold plate and the processor that is
over heating. Go to Removing a system processor module from a water-cooled 8335-GTB system and complete the steps to lift the cold plate off the processor. If the TIM pad is damaged, replace the TIM pad. To replace a TIM pad, go to Replacing a system processor module in a water-cooled 8335-GTB system and complete the steps for removing and installing a new TIM pad. This ends the procedure.
No: Continue with the next step.
5. Is a GPU over heating, but the other GPUs and the processors are not over heating?
If Then Yes: Replace the thermal interface material (TIM) between the cold plate and the GPU that is
over heating. Go to Removing the graphics processing unit from a water-cooled 8335-GTB system and complete the steps to lift the cold plate off the GPU. Then, go to Replacing the graphics processing unit in a water-cooled 8335-GTB system and complete the steps for installing a new TIM pad. If the problem is not resolved, replace the GPU. For instructions about replacing a GPU, see Removing and replacing a graphics processing unit in the 8335-GTB. This ends the procedure.
No: Continue with the next step.
6. Replace the cold plates. For instructions about how to replace the cold plates, see Removing and
replacing the cold plates in the 8335-GTB. Does the problem persist?
If Then Yes: Go to “Contacting IBM service and support” on page 110. This ends the procedure. No: This ends the procedure.

Identifying a service action

Use the following procedures to help you identify the service action that is needed.

Identifying a service action by using system event logs

Use the Intelligent Platform Management Interface (IPMI) program to examine system event logs (SELs) to identify a service action.
1. Use the ipmitool command to examine SELs.
Beginning troubleshooting and problem analysis 27
v To list SELs by using an in-band network, use the following command:
ipmitool sel elist
v To list SELs remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel elist
2. Scan the SELs for an event with the value OEM record de. Did you find a SEL with the value OEM
record de?
If Then Yes: Continue with the next step. No Go to step 4 on page 29.
3. The OEM record de specific log information is indicated by the rightmost digits of the SEL with the
value OEM record de. Use Table 17 to determine the service action to perform.
Table 17. OEM record de specific log information and service action
OEM record de specific log information Service action
00xxxxxxxxxx Go to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this SEL event continues to be logged, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page 110.
01xxxxxxxxxx Go to the “EPUB_PRC_FIND_DECONFIGURE_PART
isolation procedure” on page 96.
04xxxxxxxxxx Go to the “EPUB_PRC_SP_CODE isolation procedure”
on page 97.
05xxxxxxxxxx Go to the “EPUB_PRC_PHYP_CODE isolation
procedure” on page 97.
08xxxxxxxxxx Go to the “EPUB_PRC_ALL_PROCS isolation procedure”
on page 98.
09xxxxxxxxxx Go to the “EPUB_PRC_ALL_MEMCRDS isolation
procedure” on page 98.
0Axxxxxxxxxx Go to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this SEL event continues to be logged, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page 110.
10xxxxxxxxxx Go to the “EPUB_PRC_LVL_SUPPORT isolation
procedure” on page 99.
16xxxxxxxxxx Go to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this SEL event continues to be logged, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page 110.
1Cxxxxxxxxxx Go to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this SEL event continues to be logged, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page 110.
22xxxxxxxxxx Go to the “EPUB_PRC_MEMORY_PLUGGING_ERROR
isolation procedure” on page 100.
2Dxxxxxxxxxx Go to the “EPUB_PRC_FSI_PATH isolation procedure”
on page 100.
28 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 17. OEM record de specific log information and service action (continued)
OEM record de specific log information Service action
30xxxxxxxxxx Go to the “EPUB_PRC_PROC_AB_BUS isolation
procedure” on page 101.
31xxxxxxxxxx Go to the “EPUB_PRC_PROC_XYZ_BUS isolation
procedure” on page 101.
34xxxxxxxxxx Go to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this SEL event continues to be logged, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page 110.
37xxxxxxxxxx Go to the “EPUB_PRC_EIBUS_ERROR isolation
procedure” on page 102.
3Fxxxxxxxxxx Go to the “EPUB_PRC_POWER_ERROR isolation
procedure” on page 103.
4Dxxxxxxxxxx Go to Getting fixes and update the system firmware to
the most recent level of firmware that is available. If this SEL event continues to be logged, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page 110.
4Fxxxxxxxxxx Go to the “EPUB_PRC_MEMORY_UE isolation
procedure” on page 104.
55xxxxxxxxxx Go to the “EPUB_PRC_HB_CODE isolation procedure”
on page 104.
56xxxxxxxxxx Go to the “EPUB_PRC_TOD_CLOCK_ERR isolation
procedure” on page 106.
5Cxxxxxxxxxx Go to the “EPUB_PRC_COOLING_SYSTEM_ERR
isolation procedure” on page 106.
5Exxxxxxxxxx Go to the “EPUB_PRC_GPU_ISOLATION_PROCEDURE
isolation procedure” on page 107.
This ends the procedure.
4. Scan the SELs for an event with the value OEM record df. Did you find a SEL with the value OEM
record df?
If Then Yes: Continue with the next step. No Go to step 10 on page 31.
5. One or more events might be logged around the same time as the event with the value OEM record
df. These events require a service action if they meet the following criteria:
v A service action keyword is present. For a list of service action keywords, see “Identifying service
action keywords in system event logs” on page 36.
v Asserted is in the description. v OEM record is not in the description. v The event has a time stamp in close proximity to the time stamp of the event with the value OEM
record df.
6. Did you find any SEL events that require a service action as defined in step 5?
If Then Yes: Continue with the next step.
Beginning troubleshooting and problem analysis 29
If Then No: Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and
support” on page 110.
7. Did you find only one SEL event that requires a service action as defined in step 5 on page 29?
If Then Yes: Continue with the next step. No: Go to step 9.
8. Record the SEL record ID for the event you identified in step 5 on page 29. The SEL record ID is
indicated by the leftmost digits of the SEL. Use the ipmitool command to display the SEL details. v To display SEL details by using an in-band network, use the following command:
ipmitool sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
v To display SEL details remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
The sensor ID field contains sensor information in the format sensor name (sensor ID). Record the sensor name, sensor ID, and event description. Then, use the following information to determine the service action to perform:
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor
and event information for the 8335-GCA and 8335-GTA” on page 37.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event
information for the 8335-GTB” on page 57.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event
information for the 8348-21C” on page 78.
This ends the procedure.
9. You identified more than one event in step 5 on page 29. The service actions for all of the events that
were identified in step 5 on page 29 must be performed to successfully complete the repair. Record the SEL record IDs for the events that you identified in step 5 on page 29. The SEL record ID is indicated by the leftmost digits of the SEL. Use the ipmitool command to display SEL details for each SEL record ID that you recorded.
v To display SEL details by using an in-band network, use the following command:
ipmitool sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
v To display SEL details remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
The sensor ID field contains sensor information in the format sensor name (sensor ID). Record the sensor name, sensor ID, and event description. Then, use this information to determine the service action to perform:
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor
and event information for the 8335-GCA and 8335-GTA” on page 37.
30 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event
information for the 8335-GTB” on page 57.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event
information for the 8348-21C” on page 78.
This ends the procedure.
10. Scan the SEL for an event with the value OEM record c0.
11. Did you find an event with the value OEM record c0?
If Then Yes: Continue with the next step. No: Go to step 13 on page 35.
12. The OEM record c0 specific log information is indicated by the rightmost digits of the SEL with the
value OEM record c0. If your system is an 8335-GCA or 8335-GTA, use Table 18 to determine the service action to perform. If your system is an 8335-GTB, use Table 19 on page 32 to determine the service action to perform. If your system is an 8348-21C, use Table 20 on page 34 to determine the service action to perform.
Table 18. OEM record c0 specific log information, description, and service action for an 8335-GCA or 8335-GTA
OEM record c0 specific log information Description Service action
320a01xxxxxx Phy read failure If you are viewing this event from 320a02xxxxxx Phy speed and duplex failure
320exxxxxxxx OCC reset required This event is for information only. No
3a0400xxxxxx Chassis soft power off A user initiated power off request 3a0402xxxxxx Chassis soft reboot
3a0701xxxxxx Request for PNOR access This event is for information only. No 3a0702xxxxxx Release of PNOR access 3a1100xxxxxx Fan thread stopped 3a1101xxxxxx Fan thread started 3a1503xxxxxx Primary side boot failed Go to “Resolving a system firmware
3a1504xxxxxx Golden side boot failed Go to “Resolving a system firmware
3a1601xxxxxx Fan 1 failure Replace Fan 1. Go to “8335-GCA and
3a1602xxxxxx Fan 2 failure Replace Fan 2. Go to “8335-GCA and
the BMC, the missing or defective cable is now operational and no service action is required. Otherwise, replace the missing or failed LAN cable that attaches the console to the system.
service action is required.
occurred. No service action is required.
service action is required.
boot failure” on page 4.
boot failure” on page 4.
8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.
8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.
Beginning troubleshooting and problem analysis 31
Table 18. OEM record c0 specific log information, description, and service action for an 8335-GCA or 8335-GTA (continued)
OEM record c0 specific log information Description Service action
3a1603xxxxxx Fan 3 failure Replace Fan 3. Go to “8335-GCA and
8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.
3a1604xxxxxx Fan 4 failure Replace Fan 4. Go to “8335-GCA and
8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.
3a260xyyyyyy, where x = 1, 2, or 3 System shut down due to one or
more missing or failed fans
3a2604yyyyyy All of the fans are missing or failed Ensure that the fan power cable and
The OEM record c0 specific log information is 3a260xyyyyyy, where x is the number of fans that were missing or failed when the system was shut down. The system cannot be powered on with missing fans. If any SEL events were logged with OEM record c0 specific log information 3a16xxxxxxxx, complete the service action indicated in this table. Otherwise, replace the fans, one at a time, until the problem is resolved. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.
the disk and fan signal cable are seated properly. If the problem persists, replace the following items, one at a time, until the problem is resolved: Note: Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.
v Power riser with time-of-day
battery slot
v Fan power cable v Disk and fan signal cable v Disk drive and fan card
Table 19. OEM record c0 specific log information, description, and service action for an 8335-GTB
OEM record c0 specific log information Description Service action
320a01xxxxxx Phy read failure If you are viewing this event from 320a02xxxxxx Phy speed and duplex failure
the BMC, the missing or defective cable is now operational and no service action is required. Otherwise, replace the missing or failed LAN cable that attaches the console to the system.
32 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 19. OEM record c0 specific log information, description, and service action for an 8335-GTB (continued)
OEM record c0 specific log information Description Service action
320exxxxxxxx OCC reset required This event is for information only. No
service action is required. 3a0400xxxxxx Chassis soft power off A user initiated power off request 3a0402xxxxxx Chassis soft reboot
occurred. No service action is
required. 3a0701xxxxxx Request for PNOR access This event is for information only. No 3a0702xxxxxx Release of PNOR access
service action is required.
3a1100xxxxxx Fan thread stopped 3a1101xxxxxx Fan thread started 3a1503xxxxxx Primary side boot failed Go to “Resolving a system firmware
boot failure” on page 4. 3a1504xxxxxx Golden side boot failed Go to “Resolving a system firmware
boot failure” on page 4. 3a1601xxxxxx Fan 1 failure Replace Fan 1. Go to “8335-GTB
locations” on page 121 to identify the
physical location and removal and
replacement procedure. 3a1602xxxxxx Fan 2 failure Replace Fan 2. Go to “8335-GTB
locations” on page 121 to identify the
physical location and removal and
replacement procedure. 3a1603xxxxxx Fan 3 failure Replace Fan 3. Go to “8335-GTB
locations” on page 121 to identify the
physical location and removal and
replacement procedure. 3a1604xxxxxx Fan 4 failure Replace Fan 4. Go to “8335-GTB
locations” on page 121 to identify the
physical location and removal and
replacement procedure. 3a2600xxxxxx The water-cooled system shut down
due to too many processor core sensors reading a temperature at or above the maximum temperature that is allowed.
3a260xyyyyyy, where x = 1, 2, or 3 System shut down due to one or
more missing or failed fan
At least one processor is over
heating. Go to “Resolving an over
temperature problem for a
water-cooled 8335-GTB system” on
page 26.
The OEM record c0 specific log
information is 3a260xyyyyyy where x
is the number of fans that were
missing or failed when the system
was shut down. The system cannot
be powered on with missing fans. If
any SEL events were logged with
OEM record c0 specific log
information 3a16xxxxxxxx, complete
the service action indicated in this
table. Otherwise, replace the fans,
one at a time, until the problem is
resolved. Go to “8335-GTB locations”
on page 121 to identify the physical
location and removal and
replacement procedure.
Beginning troubleshooting and problem analysis 33
Table 19. OEM record c0 specific log information, description, and service action for an 8335-GTB (continued)
OEM record c0 specific log information Description Service action
3a2604yyyyyy All of the fans are missing or failed Ensure that the fan power cable and
the disk and fan signal cable are seated properly. If the problem persists, replace the following items, one at a time, until the problem is resolved: Note: Go to “8335-GTB locations” on page 121 to identify the physical location and removal and replacement procedure.
v Power riser with time-of-day
battery slot
v Fan power cable v Disk and fan signal cable v Disk drive and fan card
Table 20. OEM record c0 specific log information, description, and service action for an 8348-21C
OEM record c0 specific log information Description Service action
320a01xxxxxx Phy read failure If you are viewing this event from 320a02xxxxxx Phy speed and duplex failure
320exxxxxxxx OCC reset required This event is for information only. No
3a0400xxxxxx Chassis soft power off A user initiated power off request 3a0402xxxxxx Chassis soft reboot
3a0701xxxxxx Request for PNOR access This event is for information only. No 3a0702xxxxxx Release of PNOR access 3a1100xxxxxx Fan thread stopped 3a1101xxxxxx Fan thread started 3a1503xxxxxx Primary side boot failed Go to “Resolving a system firmware
3a1504xxxxxx Golden side boot failed Go to “Resolving a system firmware
3a1601xxxxxx Fan 1 failure Replace Fan 1. Go to “8348-21C
3a1602xxxxxx Fan 2 failure Replace Fan 2. Go to “8348-21C
the BMC, the missing or defective cable is now operational and no service action is required. Otherwise, replace the missing or failed LAN cable that attaches the console to the system.
service action is required.
occurred. No service action is required.
service action is required.
boot failure” on page 4.
boot failure” on page 4.
locations” on page 133 to identify the physical location and removal and replacement procedure.
locations” on page 133 to identify the physical location and removal and replacement procedure.
34 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 20. OEM record c0 specific log information, description, and service action for an 8348-21C (continued)
OEM record c0 specific log information Description Service action
3a1603xxxxxx Fan 3 failure Replace Fan 3. Go to “8348-21C
locations” on page 133 to identify the
physical location and removal and
replacement procedure. 3a1604xxxxxx Fan 4 failure Replace Fan 4. Go to “8348-21C
locations” on page 133 to identify the
physical location and removal and
replacement procedure. 3a1605xxxxxx Fan 5 failure Replace Fan 5. Go to “8348-21C
locations” on page 133 to identify the
physical location and removal and
replacement procedure. 3a260xyyyyyy, where x = 1, 2, 3, or 4 System shut down due to one or
more missing or failed fans
3a2605yyyyyy All of the fans are missing or failed Replace the disk drive backplane. Go
The OEM record c0 specific log
information is 3a260xyyyyyy, where x
is the number of fans that were
missing or failed when the system
was shut down. The system cannot
be powered on with missing or failed
fans. If any SEL events were logged
with OEM record c0 specific log
information 3a16xxxxxxxx, complete
the service action indicated in this
table. Otherwise, replace the fans,
one at a time, until the problem is
resolved. Go to “8348-21C locations”
on page 133 to identify the physical
location and removal and
replacement procedure.
to “8348-21C locations” on page 133
to identify the physical location and
removal and replacement procedure.
13. One or more SEL events might require a service action. These events require a service action if they
meet the following criteria: v A service action keyword is present. For a list of service action keywords, see “Identifying service
action keywords in system event logs” on page 36.
v Asserted is in the description. v OEM record is not in the description.
14. Did you find one or more SEL events that require a service action as defined in step 13?
If Then Yes: Continue with the next step. No: This ends the procedure.
15. The service actions for all of the events that were identified in step 13 must be performed to
successfully complete the repair. Record the SEL record IDs for the events that you identified in step
13. The SEL record ID is indicated by the leftmost digits of the SEL. Use the ipmitool command to display SEL details for each SEL record ID that you recorded.
v To display SEL details by using an in-band network, use the following command:
ipmitool sel get <SEL record ID>
Beginning troubleshooting and problem analysis 35
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
v To display SEL details remotely over the LAN, use the following command:
ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel get <SEL record ID>
Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.
The sensor ID field contains sensor information in the format sensor name (sensor ID). Record the sensor name, sensor ID, and event description. Then, use this information to determine the service action to perform:
v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor
and event information for the 8335-GCA and 8335-GTA” on page 37.
v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event
information for the 8335-GTB” on page 57.
v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event
information for the 8348-21C” on page 78.
This ends the procedure.

Identifying service action keywords in system event logs

System event logs (SELs) that have Asserted and any of the keywords indicated below in the description require a service action.
Temperature, voltage, and current service action keywords
v Transition to Critical from Less Severe v Transition to Critical from Non-recoverable v Transition to Non-recoverable
Fan service action keywords
v Transition to Critical from Less Severe v Transition to Non-recoverable from Less Severe v Transition to Critical from Non-recoverable v Device Removed / Device Absent v Transition to degraded v Install error v Redundancy lost v Non-redundant insufficient resources
Memory service action keywords
v Configuration Error v Transition to Non-recoverable v Predictive Failure
Processor service action keywords
v IERR v Transition to Non-recoverable v Predictive Failure
Power supply and All PGood service action keywords
v Power Supply Failure Detected v Predictive Failure
36 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
v Power Supply Input Lost or AC DC v Power Supply Input Lost Or Out of Range v Power Supply Input Out of Range But Present v Configuration Error v Transition to Critical from Less Severe v Transition to Non-recoverable from Less Severe v Transition to Critical from Non-recoverable v Transition to Non-recoverable v Redundancy lost v Non-redundant insufficient resources v AC Lost v Soft Power Control Failure v Power Unit Failure Detected v Predictive Failure
System firmware service action keywords
v System Firmware Error v System Firmware Hang v Transition to Critical from Less Severe v Transition to Non-recoverable from Less Severe v Transition to Critical from Non-recoverable v Transition to Non-recoverable
System ACPI power state service action keywords
v Unknown
Watchdog service action keywords
v Hard Reset v Power Down v Power Cycle v Timer Interrupt
System event service action keywords
v Undetermined system hardware failure
OS boot service action keywords
v Installation aborted v Installation failed

Identifying a service action by using sensor and event information

You can use sensor and event information from the system event log (SEL) to determine a service action.
Identifying a service action by using sensor and event information for the 8335-GCA and 8335-GTA
You can use the sensor and event information from the system event log (SEL) to determine a service action to perform for the IBM Power®System S822LC (8335-GCA and 8335-GTA).
If you have not done so already, complete “Identifying a service action by using system event logs” on page 27. Then, use the following table to determine the service action to perform.
Beginning troubleshooting and problem analysis 37
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA
Sensor name (Sensor ID) Event description Service action
Watchdog (0x00)
Host Status (0x04) Unknown Go to Getting fixes and update the
v Timer Expired v Reserved1 v Reserved2 v Reserved3 v Reserved4
v Hard Reset v Power Down v Power Cycle v Timer Interrupt
v S0/Go “Working” v S1 “Sleeping with system h/w &
processor context maintained”
v S2 “sleeping, processor context
lost”
v S3 “sleeping, processor & h/w
context lost, memory retained”
v S4 “non-volatile sleep / suspend-to
disk”
v S5 / G2: “soft-off” v S4 / S5: “soft-off” v G3 mechanical Off v Sleeping in an S1/S2/S3 State v G1: Sleeping v S5: entered by override v Legacy ON state v Legacy OFF state
No service action is required.
SEL events with OEM record c0 | 000e000 | 3a150xxxxxxx indicate that a boot failed. Search for boot failure SEL events that have a time stamp in close proximity to the time stamp of this SEL event. If events exist, go to “Resolving a system firmware boot failure” on page 4. If there are no boot failure SEL events and the system booted correctly, no service action is required.
system firmware to the most recent level of firmware that is available. If this SEL event continues to be logged each time you power on the system, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page
110. No service action is required.
38 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
FW Boot Progress (0x05)
v System Firmware Error v System Firmware Hang
SEL events with OEM record c0 |
000e000 | 3a150xxxxxxx indicate that
a boot failed. Search for boot failure
SEL events that have a time stamp in
close proximity to the time stamp of
this SEL event. If events exist, go to
“Resolving a system firmware boot
failure” on page 4.
System Firmware Progress No service action is required.
v OCC 1 Active (0x08) v OCC 2 Active (0x09)
Device Disabled If the sensor name is OCC 1 Active,
replace CPU 1. If the sensor name is
OCC 2 Active, replace CPU 2. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
v State Deasserted
No service action is required.
v Device Enabled
Ambient Temp (0x0A)
v Upper Critical - going low
No service action is required.
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Lower Critical - going low v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Upper Critical - going high Ensure that the room temperature
meets the requirements that are
specified for the system. Ensure that
no obstructions are blocking air flow
to the system.
Beginning troubleshooting and problem analysis 39
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v CPU1 Temp (0x0B) v CPU2 Temp (0x0D)
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical - going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Lower Critical - going low v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
40 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v CPU Func 1 (0x0C) v CPU Func 2 (0x0E)
v IERR v Transition to Non-recoverable v Predictive Failure
If the sensor name is CPU Func 1,
replace CPU 1. If the sensor name is
CPU Func 2, replace CPU 2. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
v Thermal Trip
No service action is required.
v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup
Initialization Failure
v Configuration Error v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled v Terminator Presence Detected v Processor Automatically Throttled v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Processor Presence Detected v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
Beginning troubleshooting and problem analysis 41
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
All PGood (0x1C)
v Interlock Power Down v Power Off Power Down v Power Cycle v 240VA Power Down
No service action is required.
v AC Lost v Soft Power Control Failure
v Power Unit Failure Detected v Predictive Failure
v Ensure that ac power is supplied
to the rack.
v Ensure that the system power
cords are plugged tightly into both the power supply and the rack power distribution unit (PDU) for both system power supplies.
v Ensure that the system was not
powered off.
v Ensure that ac power is supplied
to the rack.
v Ensure that the power supply
cords are plugged tightly into the power supplies and the rack PDU unit.
v Ensure that the system was not
powered off.
v Check for service action required
SEL events for the power supply sensor. If any exist, follow the service action that is specified in “Identifying a service action by using sensor and event information for the 8335-GCA and 8335-GTA” on page 37.
42 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v DIMM Func 1 (0x1E) v DIMM Func 2 (0x1F) v DIMM Func 3 (0x20) v DIMM Func 4 (0x21) v DIMM Func 5 (0x22) v DIMM Func 6 (0x23) v DIMM Func 7 (0x24) v DIMM Func 8 (0x25) v DIMM Func 9 (0x26) v DIMM Func 10 (0x27) v DIMM Func 11 (0x28) v DIMM Func 12 (0x29) v DIMM Func 13 (0x2A) v DIMM Func 14 (0x2B) v DIMM Func 15 (0x2C) v DIMM Func 16 (0x2D) v DIMM Func 17 (0x2E) v DIMM Func 18 (0x2F) v DIMM Func 19 (0x30) v DIMM Func 20 (0x31) v DIMM Func 21 (0x32) v DIMM Func 22 (0x33) v DIMM Func 23 (0x34) v DIMM Func 24 (0x35) v DIMM Func 25 (0x36) v DIMM Func 26 (0x37) v DIMM Func 27 (0x38) v DIMM Func 28 (0x39) v DIMM Func 29 (0x3A) v DIMM Func 30 (0x3B) v DIMM Func 31 (0x3C) v DIMM Func 32 (0x3D)
v Memory Device Disabled v Uncorrectable Memory Error v Memory Scrub Failed v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Correctable Memory Error v Parity v Correctable Memory Error Logging
Limit Reached
v Memory Automatically Throttled v Critical Over temperature v Presence Detected v Spare v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
v Transition to Non-recoverable v Predictive Failure
No service action is required.
If the sensor name is DIMM Func 1,
replace DIMM 1. If the sensor name
is DIMM Func 2, replace DIMM 2.
And so on. Go to “8335-GCA and
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
Beginning troubleshooting and problem analysis 43
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v DIMM Func 1 (0x1E) v DIMM Func 2 (0x1F) v DIMM Func 3 (0x20) v DIMM Func 4 (0x21) v DIMM Func 5 (0x22) v DIMM Func 6 (0x23) v DIMM Func 7 (0x24) v DIMM Func 8 (0x25) v DIMM Func 9 (0x26) v DIMM Func 10 (0x27) v DIMM Func 11 (0x28) v DIMM Func 12 (0x29) v DIMM Func 13 (0x2A) v DIMM Func 14 (0x2B) v DIMM Func 15 (0x2C) v DIMM Func 16 (0x2D) v DIMM Func 17 (0x2E) v DIMM Func 18 (0x2F) v DIMM Func 19 (0x30) v DIMM Func 20 (0x31) v DIMM Func 21 (0x32) v DIMM Func 22 (0x33) v DIMM Func 23 (0x34) v DIMM Func 24 (0x35) v DIMM Func 25 (0x36) v DIMM Func 26 (0x37) v DIMM Func 27 (0x38) v DIMM Func 28 (0x39) v DIMM Func 29 (0x3A) v DIMM Func 30 (0x3B) v DIMM Func 31 (0x3C) v DIMM Func 32 (0x3D)
Configuration Error Complete the following steps:
1. If the sensor name is DIMM Func
1, ensure that DIMM 1 is seated properly. If the sensor name is DIMM Func 2, ensure that DIMM 2 is seated properly. And so on.
2. If you recently installed or
replaced memory DIMMs, ensure that the DIMMs are plugged in the correct memory slots.
3. If the sensor name is DIMM Func
1, replace DIMM 1. If the sensor name is DIMM Func 2, replace DIMM 2. And so on. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.
44 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v CPU Core Func 1 (0x3E) v CPU Core Func 2 (0x3F) v CPU Core Func 3 (0x40) v CPU Core Func 4 (0x41) v CPU Core Func 5 (0x42) v CPU Core Func 6 (0x43) v CPU Core Func 7 (0x44) v CPU Core Func 8 (0x45) v CPU Core Func 9 (0x46) v CPU Core Func 10 (0x47) v CPU Core Func 11 (0x48) v CPU Core Func 12 (0x49)
v IERR v Transition to Non-recoverable v Predictive Failure
v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup
Initialization Failure
v Configuration Error v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled
Replace system processor CPU 1. Go
to “8335-GCA and 8335-GTA
locations” on page 111 to identify the
physical location and removal and
replacement procedure.
No service action is required.
v Terminator Presence Detected v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Thermal Trip v Processor Automatically Throttled v Processor Presence Detected v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
Beginning troubleshooting and problem analysis 45
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v CPU Core Func 13 (0x4A) v CPU Core Func 14 (0x4B) v CPU Core Func 15 (0x4C) v CPU Core Func 16 (0x4D) v CPU Core Func 17 (0x4E) v CPU Core Func 18 (0x4F) v CPU Core Func 19 (0x50) v CPU Core Func 20 (0x51) v CPU Core Func 21 (0x52) v CPU Core Func 22 (0x53) v CPU Core Func 23 (0x54) v CPU Core Func 24 (0x55)
v IERR v Transition to Non-recoverable v Predictive Failure
v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup
Initialization Failure
v Configuration Error v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled v Terminator Presence Detected v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Thermal Trip v Processor Automatically Throttled v Processor Presence Detected v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
Replace system processor CPU 2. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.
No service action is required.
46 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v Mem Buf Func 1 (0x56) v Mem Buf Func 2 (0x57) v Mem Buf Func 3 (0x58) v Mem Buf Func 4 (0x59) v Mem Buf Func 5 (0x5A) v Mem Buf Func 6 (0x5B) v Mem Buf Func 7 (0x5C) v Mem Buf Func 8 (0x5D)
v Uncorrectable Memory Error v Memory Device Disabled v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
No service action is required.
Non-recoverable
v Correctable Memory Error v Parity v Memory Scrub Failed v Correctable Memory Error Logging
Limit Reached
v Memory Automatically Throttled v Critical Over temperature v Presence Detected v Spare v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
v Configuration Error v Transition to Non-recoverable v Predictive Failure
If the sensor name is Mem Buf Func
1, replace memory riser 1. If the
sensor name is Mem Buf Func 2,
replace memory riser 2. And so on.
Go to “8335-GCA and 8335-GTA
locations” on page 111 to identify the
physical location and removal and
replacement procedure. Boot Count (0x5F) None No service action is required. Motherboard Flt (0x60) State Deasserted No service action is required.
State Asserted Replace the system backplane. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
Beginning troubleshooting and problem analysis 47
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
System Event (0x61) Undetermined system hardware
failure
v System Reconfigured v OEM System boot event v Entry added to auxiliary log v PEF Action v Timestamp Clock Sync v Transition State Active v Transition State Idle v Transition State Busy
Activate Pwr Lt (0x62) None No service action is required.
v Ref Clock Fault (0x63) v PCI Clock Fault (0x64)
v State Deasserted v State Asserted
Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page
110. No service action is required.
No service action is required.
48 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v DIMM1 Temp (0x69) v DIMM2 Temp (0x6A) v DIMM3 Temp (0x6B) v DIMM4 Temp (0X6C) v DIMM5 Temp (0x6D) v DIMM6 Temp (0x6E) v DIMM7 Temp (0x6F) v DIMM8 Temp (0x70) v DIMM9 Temp (0x71) v DIMM10 Temp (0x72) v DIMM11 Temp (0x73) v DIMM12 Temp (0x74) v DIMM13 Temp (0x75) v DIMM14 Temp (0x76) v DIMM15 Temp (0x77)
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
v DIMM16 Temp (0x78) v DIMM17 Temp (0x79) v DIMM18 Temp (0x7A) v DIMM19 Temp (0x7B) v DIMM20 Temp (0x7C) v DIMM21 Temp (0x7D) v DIMM22 Temp (0x7E) v DIMM23 Temp (0x7F) v DIMM24 Temp (0x80) v DIMM25 Temp (0x81) v DIMM26 Temp (0x82) v DIMM27 Temp (0x83) v DIMM28 Temp (0x84) v DIMM29 Temp (0x85) v DIMM30 Temp (0x86) v DIMM31 Temp (0x87) v DIMM32 Temp (0x88)
Beginning troubleshooting and problem analysis 49
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v CPU Core Temp 1 (0x89) v CPU Core Temp 2 (0x8A) v CPU Core Temp 3 (0x8B) v CPU Core Temp 4 (0x8C) v CPU Core Temp 5 (0x8D) v CPU Core Temp 6 (0x8E) v CPU Core Temp 7 (0x8F) v CPU Core Temp 8 (0x90) v CPU Core Temp 9 (0x91) v CPU Core Temp 10 (0x92) v CPU Core Temp 11 (0x93) v CPU Core Temp 12 (0x94)
v CPU Core Temp 13 (0x95) v CPU Core Temp 14 (0x96) v CPU Core Temp 15 (0x97) v CPU Core Temp 16 (0x98) v CPU Core Temp 17 (0x99) v CPU Core Temp 18 (0x9A) v CPU Core Temp 19 (0x9B) v CPU Core Temp 20 (0x9C) v CPU Core Temp 21 (0x9D) v CPU Core Temp 22 (0x9E) v CPU Core Temp 23 (0x9F) v CPU Core Temp 24 (0xA0)
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
No service action is required.
50 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v 12V Sense (0xA1) v Proc0 Power (0xA2) v Proc1 Power (0xA3) v PCIE Proc0 Pwr (0xA6) v PCIE Proc1 Pwr (0xA7) v GPU Sense (0xAA) v Mem Cache Power (0xAB) v Mem Proc0 Pwr (0xAC) v Mem Proc1 Pwr (0xAD) v Fan Power A (0xB0)
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low
No service action required.
v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v TOD Clock Fault (0xB1) v APSS Fault (0xB2)
v State Deasserted v State Asserted
No service action is required.
PS Derating Factor (0xB4) None No service action is required. OS Boot (0xB5)
v Installation aborted v Installation failed
Ensure that the operating system
boot image is loaded. Ensure that the
disk drive or solid-state drive is
ready. Reload the operating system
boot image.
v A: boot completed
No service action is required.
v C: boot completed v PXE boot completed v Diagnostic boot completed v CD-ROM boot completed v ROM boot completed v Boot completed - device not
specified
v Installation started v Installation completed
PCI (0xB6)
v State Deasserted
No service action is required.
v State Asserted
Beginning troubleshooting and problem analysis 51
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v GPU Func 1 (0xB8) v GPU Func 2 (0xB9) v GPU Func 3 (0xBA) v GPU Func 4 (0xBB)
v GPU Temp 1 (0xBC) v GPU Temp 2 (0xBD) v GPU Temp 3 (0xBE) v GPU Temp 4 (0xBF)
v Mem Buf Temp 1 (0xC0) v Mem Buf Temp 2 (0xC1) v Mem Buf Temp 3 (0xC2) v Mem Buf Temp 4 (0xC3) v Mem Buf Temp 5 (0xC4) v Mem Buf Temp 6 (0xC5) v Mem Buf Temp 7 (0xC6) v Mem Buf Temp 8 (0xC7)
v Uncorrectable Memory Error v Parity v Memory Scrub Failed v Memory Device Disabled v Configuration Error v Memory Automatically Throttled
v Correctable Memory Error v Parity v Correctable Memory Error Logging
Limit Reached
v Presence Detected v Spare v Critical Over temperature
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
If the sensor name is GPU Func 1 or GPU Func 2, replace GPU 1. If the sensor name is GPU Func 3 or GPU Func 4, replace GPU 2. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.
No service action is required.
No service action is required.
No service action is required.
52 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v CPU Diode 1 (0xC8) v CPU Diode 2 (0xCB)
v Lower Non-critical – going low v Lower Non-critical – going high
No service action is required.
v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Beginning troubleshooting and problem analysis 53
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
Checkstop (0xC9) IERR If this event immediately precedes a
system power off, no service action is required. Otherwise, search for SEL events that meet the following criteria:
v The event has a time stamp in
close proximity to the time stamp of this event.
v A service action keyword is
present. For a list of service action keywords, see “Identifying service action keywords in system event logs” on page 36.
v Asserted is in the description.
If you found a SEL event that matches the criteria, perform the service action that is indicated in this table for the SEL event. Otherwise, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page
110.
v Thermal Trip v Configuration Error v Processor Automatically Throttled v Correctable Machine Check Error v Processor Presence Detected
v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup
Initialization Failure
v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled v Terminator Presence Detected v Machine Check Exception
No service action is required.
Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page
110.
54 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v PSU Fault 1 (0xCD) v PSU Fault 2 (0xCE)
Power Supply Failure Detected An assert event immediately
followed by a deassert event
indicates that a power cycle of the
system occurred. No service action is
required. If there is no deassert event
immediately following the assert
event, replace the power supply. If
the sensor name is PSU Fault 1,
replace PSU 1. If the sensor name is
PSU Fault 2, replace PSU 2. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
v Predictive Failure v Power Supply Input Out of Range
But Present
If the sensor name is PSU Fault 1,
replace PSU 1. If the sensor name is
PSU Fault 2, replace PSU 2. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
v Power Supply Input Lost or AC
DC
v Power Supply Input Lost Or Out
Of Range
Ensure that ac power is supplied to
the rack. Ensure that the system
power cords are plugged tightly into
both the power supply and the rack
PDU unit for both system power
supplies. Go to “8335-GCA and
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
Configuration Error Ensure that both power supplies are
securely seated in the system. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
v Presence Detected
No service action is required.
v Power Supply Inactive
Beginning troubleshooting and problem analysis 55
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
CPU VDD Volt (0xCF)
CPU VDD Curr (0xD0)
BIOS Golden Side (0xD2) None Go to “Resolving a system firmware
BMC Golden Side (0xD3) None Go to “Resolving a system firmware
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
No service action is required.
boot failure” on page 4 and follow the service action for a system event log (SEL) with the value OEM record c0 and OEM c0 specific log information 3a1504xxxxxx.
boot failure” on page 4 and follow the service action for a system event log (SEL) with the value OEM record c0 and OEM c0 specific log information 3a1504xxxxxx.
56 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)
Sensor name (Sensor ID) Event description Service action
v Fan 1 (0xD4) v Fan 2 (0xD5) v Fan 3 (0xD6) v Fan 4 (0xD7)
CurPwr Redundant (0xD8)
NxtPwr Redundant (0xD9)
Turbo Allowed (0xDA)
v Transition to Critical from less
Severe
v Transition to Non-recoverable from
less severe
v Transition to critical from
non-recoverable
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Device Inserted/Device Present
v Device Removed/Device Absent v Transition to degraded v Install error v Redundancy lost v Non-redundant insufficient
resources
v State Deasserted v State Asserted
v State Deasserted v State Asserted
v State Deasserted v State Asserted
If the sensor name is Fan 1, replace
Fan 1. If the sensor name is Fan 2,
replace Fan 2. And so on. Go to
“8335-GCA and 8335-GTA locations”
on page 111 to identify the physical
location and removal and
replacement procedure.
No service action is required.
Ensure that all fans are seated
securely. Go to “8335-GCA and
8335-GTA locations” on page 111 to
identify the physical location and
removal and replacement procedure.
No service action is required.
No service action is required.
No service action is required.
Identifying a service action by using sensor and event information for the 8335-GTB
You can use the sensor and event information from the system event log (SEL) to determine a service action to perform for the IBM Power System S822LC (8335-GTB).
If you have not done so already, complete “Identifying a service action by using system event logs” on page 27. Then, use the following table to determine the service action to perform.
Beginning troubleshooting and problem analysis 57
Table 22. Sensor information, event description, and service action for the 8335-GTB
Sensor name (Sensor ID) Event description Service action
Watchdog (0x00)
Host Status (0x04) Unknown Go to Getting fixes and update the
v Timer Expired v Reserved1 v Reserved2 v Reserved3 v Reserved4
v Hard Reset v Power Down v Power Cycle v Timer Interrupt
v S0/Go “Working” v S1 “Sleeping with system h/w &
processor context maintained”
v S2 “sleeping, processor context
lost”
v S3 “sleeping, processor & h/w
context lost, memory retained”
v S4 “non-volatile sleep / suspend-to
disk”
v S5 / G2: “soft-off” v S4 / S5: “soft-off” v G3 mechanical Off v Sleeping in an S1/S2/S3 State v G1: Sleeping v S5: entered by override v Legacy ON state v Legacy OFF state
No service action is required.
SEL events with OEM record c0 | 000e000 | 3a150xxxxxxx indicate that a boot failed. Search for boot failure SEL events that have a time stamp close to the time stamp of this SEL event. If events exist, go to “Resolving a system firmware boot failure” on page 4. If there are no boot failure SEL events and the system booted correctly, no service action is required.
system firmware to the most recent level of firmware that is available. If this SEL event continues to be logged each time you power on the system, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page
110. No service action is required.
58 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
FW Boot Progress (0x05)
v System Firmware Error v System Firmware Hang
SEL events with OEM record c0 |
000e000 | 3a150xxxxxxx indicate that
a boot failed. Search for boot failure
SEL events that have a time stamp
close to the time stamp of this SEL
event. If events exist, go to
“Resolving a system firmware boot
failure” on page 4.
System Firmware Progress No service action is required.
v OCC 1 Active (0x08) v OCC 2 Active (0x09)
Device Disabled If the sensor name is OCC 1 Active,
replace CPU 1. If the sensor name is
OCC 2 Active, replace CPU 2. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
v State Deasserted
No service action is required.
v Device Enabled
Ambient Temp (0x0A)
v Upper Critical - going low
No service action is required.
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Lower Critical - going low v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Upper Critical - going high Ensure that the room temperature
meets the requirements that are
specified for the system. Ensure that
no obstructions are blocking air flow
to the system.
Beginning troubleshooting and problem analysis 59
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v CPU1 Temp (0x0B) v CPU2 Temp (0x0D)
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical - going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Lower Critical - going low v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
60 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v CPU Func 1 (0x0C) v CPU Func 2 (0x0E)
v IERR v Transition to Non-recoverable v Predictive Failure
If the sensor name is CPU Func 1,
replace CPU 1. If the sensor name is
CPU Func 2, replace CPU 2. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
v Thermal Trip
No service action is required.
v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup
Initialization Failure
v Configuration Error v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled v Terminator Presence Detected v Processor Automatically Throttled v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Processor Presence Detected v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
Beginning troubleshooting and problem analysis 61
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
All PGood (0x1C)
v Interlock Power Down v Power Off Power Down v Power Cycle v 240VA Power Down
No service action is required.
v AC Lost v Soft Power Control Failure
v Power Unit Failure Detected v Predictive Failure
v Ensure that ac power is supplied
to the rack.
v Ensure that the system power
cords are plugged tightly into both the power supply and the rack power distribution unit (PDU) for both system power supplies.
v Ensure that the system was not
powered off.
v Ensure that ac power is supplied
to the rack.
v Ensure that the power supply
cords are plugged tightly into the power supplies and the rack PDU unit.
v Ensure that the system was not
powered off.
v Check for service action required
SEL events for the power supply sensor. If any exist, follow the service action that is specified in “Identifying a service action by using sensor and event information for the 8335-GTB” on page 57.
62 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v DIMM Func 1 (0x1E) v DIMM Func 2 (0x1F) v DIMM Func 3 (0x20) v DIMM Func 4 (0x21) v DIMM Func 5 (0x22) v DIMM Func 6 (0x23) v DIMM Func 7 (0x24) v DIMM Func 8 (0x25) v DIMM Func 9 (0x26) v DIMM Func 10 (0x27) v DIMM Func 11 (0x28) v DIMM Func 12 (0x29) v DIMM Func 13 (0x2A) v DIMM Func 14 (0x2B) v DIMM Func 15 (0x2C) v DIMM Func 16 (0x2D) v DIMM Func 17 (0x2E) v DIMM Func 18 (0x2F) v DIMM Func 19 (0x30) v DIMM Func 20 (0x31) v DIMM Func 21 (0x32) v DIMM Func 22 (0x33) v DIMM Func 23 (0x34) v DIMM Func 24 (0x35) v DIMM Func 25 (0x36) v DIMM Func 26 (0x37) v DIMM Func 27 (0x38) v DIMM Func 28 (0x39) v DIMM Func 29 (0x3A) v DIMM Func 30 (0x3B) v DIMM Func 31 (0x3C) v DIMM Func 32 (0x3D)
v Memory Device Disabled v Uncorrectable Memory Error v Memory Scrub Failed v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Correctable Memory Error v Parity v Correctable Memory Error Logging
Limit Reached
v Memory Automatically Throttled v Critical Over temperature v Presence Detected v Spare v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
v Transition to Non-recoverable v Predictive Failure
No service action is required.
If the sensor name is DIMM Func 1,
replace DIMM 1. If the sensor name
is DIMM Func 2, replace DIMM 2.
And so on. Go to “8335-GTB
locations” on page 121 to identify the
physical location and removal and
replacement procedure.
Beginning troubleshooting and problem analysis 63
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v DIMM Func 1 (0x1E) v DIMM Func 2 (0x1F) v DIMM Func 3 (0x20) v DIMM Func 4 (0x21) v DIMM Func 5 (0x22) v DIMM Func 6 (0x23) v DIMM Func 7 (0x24) v DIMM Func 8 (0x25) v DIMM Func 9 (0x26) v DIMM Func 10 (0x27) v DIMM Func 11 (0x28) v DIMM Func 12 (0x29) v DIMM Func 13 (0x2A) v DIMM Func 14 (0x2B) v DIMM Func 15 (0x2C) v DIMM Func 16 (0x2D) v DIMM Func 17 (0x2E) v DIMM Func 18 (0x2F) v DIMM Func 19 (0x30) v DIMM Func 20 (0x31) v DIMM Func 21 (0x32) v DIMM Func 22 (0x33) v DIMM Func 23 (0x34) v DIMM Func 24 (0x35) v DIMM Func 25 (0x36) v DIMM Func 26 (0x37) v DIMM Func 27 (0x38) v DIMM Func 28 (0x39) v DIMM Func 29 (0x3A) v DIMM Func 30 (0x3B) v DIMM Func 31 (0x3C) v DIMM Func 32 (0x3D)
Configuration Error Complete the following steps:
1. If the sensor name is DIMM Func
1, ensure that DIMM 1 is seated properly. If the sensor name is DIMM Func 2, ensure that DIMM 2 is seated properly. And so on.
2. If you recently installed or
replaced memory DIMMs, ensure that the DIMMs are plugged in the correct memory slots.
3. If the sensor name is DIMM Func
1, replace DIMM 1. If the sensor name is DIMM Func 2, replace DIMM 2. And so on. Go to “8335-GTB locations” on page 121 to identify the physical location and removal and replacement procedure.
64 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v CPU Core Func 1 (0x3E) v CPU Core Func 2 (0x3F) v CPU Core Func 3 (0x40) v CPU Core Func 4 (0x41) v CPU Core Func 5 (0x42) v CPU Core Func 6 (0x43) v CPU Core Func 7 (0x44) v CPU Core Func 8 (0x45) v CPU Core Func 9 (0x46) v CPU Core Func 10 (0x47) v CPU Core Func 11 (0x48) v CPU Core Func 12 (0x49)
v IERR v Transition to Non-recoverable v Predictive Failure
v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup
Initialization Failure
v Configuration Error v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled v Terminator Presence Detected
Replace system processor CPU 1. Go
to “8335-GTB locations” on page 121
to identify the physical location and
removal and replacement procedure.
No service action is required.
v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Thermal Trip v Processor Automatically Throttled v Processor Presence Detected v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
Beginning troubleshooting and problem analysis 65
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v CPU Core Func 13 (0x4A) v CPU Core Func 14 (0x4B) v CPU Core Func 15 (0x4C) v CPU Core Func 16 (0x4D) v CPU Core Func 17 (0x4E) v CPU Core Func 18 (0x4F) v CPU Core Func 19 (0x50) v CPU Core Func 20 (0x51) v CPU Core Func 21 (0x52) v CPU Core Func 22 (0x53) v CPU Core Func 23 (0x54) v CPU Core Func 24 (0x55)
v IERR v Transition to Non-recoverable v Predictive Failure
v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup
Initialization Failure
v Configuration Error v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled v Terminator Presence Detected v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Thermal Trip v Processor Automatically Throttled v Processor Presence Detected v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
Replace system processor CPU 2. Go to “8335-GTB locations” on page 121 to identify the physical location and removal and replacement procedure.
No service action is required.
66 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v Mem Buf Func 1 (0x56) v Mem Buf Func 2 (0x57) v Mem Buf Func 3 (0x58) v Mem Buf Func 4 (0x59) v Mem Buf Func 5 (0x5A) v Mem Buf Func 6 (0x5B) v Mem Buf Func 7 (0x5C) v Mem Buf Func 8 (0x5D)
v Uncorrectable Memory Error v Memory Device Disabled v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
No service action is required.
Non-recoverable
v Correctable Memory Error v Parity v Memory Scrub Failed v Correctable Memory Error Logging
Limit Reached
v Memory Automatically Throttled v Critical Over temperature v Presence Detected v Spare v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
v Configuration Error v Transition to Non-recoverable v Predictive Failure
If the sensor name is Mem Buf Func
1, replace memory riser 1. If the
sensor name is Mem Buf Func 2,
replace memory riser 2. And so on.
Go to “8335-GTB locations” on page
121 to identify the physical location
and removal and replacement
procedure. Boot Count (0x5F) None No service action is required. Motherboard Flt (0x60) State Deasserted No service action is required.
State Asserted Replace the system backplane. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
Beginning troubleshooting and problem analysis 67
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
System Event (0x61) Undetermined system hardware
failure
v System Reconfigured v OEM System boot event v Entry added to auxiliary log v PEF Action v Timestamp Clock Sync v Transition State Active v Transition State Idle v Transition State Busy
Activate Pwr Lt (0x62) None No service action is required.
v Ref Clock Fault (0x63) v PCI Clock Fault (0x64)
v State Deasserted v State Asserted
Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page
110. No service action is required.
No service action is required.
68 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v DIMM1 Temp (0x69) v DIMM2 Temp (0x6A) v DIMM3 Temp (0x6B) v DIMM4 Temp (0x6C) v DIMM5 Temp (0x6D) v DIMM6 Temp (0x6E) v DIMM7 Temp (0x6F) v DIMM8 Temp (0x70) v DIMM9 Temp (0x71) v DIMM10 Temp (0x72) v DIMM11 Temp (0x73) v DIMM12 Temp (0x74) v DIMM13 Temp (0x75) v DIMM14 Temp (0x76) v DIMM15 Temp (0x77)
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
v DIMM16 Temp (0x78) v DIMM17 Temp (0x79) v DIMM18 Temp (0x7A) v DIMM19 Temp (0x7B) v DIMM20 Temp (0x7C) v DIMM21 Temp (0x7D) v DIMM22 Temp (0x7E) v DIMM23 Temp (0x7F) v DIMM24 Temp (0x80) v DIMM25 Temp (0x81) v DIMM26 Temp (0x82) v DIMM27 Temp (0x83) v DIMM28 Temp (0x84) v DIMM29 Temp (0x85) v DIMM30 Temp (0x86) v DIMM31 Temp (0x87) v DIMM32 Temp (0x88)
Beginning troubleshooting and problem analysis 69
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v CPU Core Temp 1 (0x89) v CPU Core Temp 2 (0x8A) v CPU Core Temp 3 (0x8B) v CPU Core Temp 4 (0x8C) v CPU Core Temp 5 (0x8D) v CPU Core Temp 6 (0x8E) v CPU Core Temp 7 (0x8F) v CPU Core Temp 8 (0x90) v CPU Core Temp 9 (0x91) v CPU Core Temp 10 (0x92) v CPU Core Temp 11 (0x93) v CPU Core Temp 12 (0x94)
v CPU Core Temp 13 (0x95) v CPU Core Temp 14 (0x96) v CPU Core Temp 15 (0x97) v CPU Core Temp 16 (0x98) v CPU Core Temp 17 (0x99) v CPU Core Temp 18 (0x9A) v CPU Core Temp 19 (0x9B) v CPU Core Temp 20 (0x9C) v CPU Core Temp 21 (0x9D) v CPU Core Temp 22 (0x9E) v CPU Core Temp 23 (0x9F) v CPU Core Temp 24 (0xA0)
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
No service action is required.
70 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v System Power (0xA1) v Proc0 Power (0xA2) v Proc1 Power (0xA3) v PCIE Proc0 Pwr (0xA6) v PCIE Proc1 Power (0xA7) v GPU Power (0xAA) v Mem Cache Power (0xAB) v Mem Proc0 Pwr (0xAC) v Mem Proc1 Pwr (0xAD) v Fan Power (0xB0)
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low
No service action required.
v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v TOD Clock Fault (0xB1) v APSS Fault (0xB2)
v State Deasserted v State Asserted
No service action is required.
PS Derating Fac (0xB4) None No service action is required. OS Boot (0xB5)
v Installation aborted v Installation failed
Ensure that the operating system
boot image is loaded. Ensure that the
disk drive or solid-state drive is
ready. Reload the operating system
boot image.
v A: boot completed
No service action is required.
v C: boot completed v PXE boot completed v Diagnostic boot completed v CD-ROM boot completed v ROM boot completed v Boot completed - device not
specified
v Installation started v Installation completed
PCI (0xB6)
v State Deasserted
No service action is required.
v State Asserted
Beginning troubleshooting and problem analysis 71
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v GPU Func 1 (0xB8) v GPU Func 2 (0xB9) v GPU Func 3 (0xBA) v GPU Func 4 (0xBB)
v GPU Temp 1 (0xBC) v GPU Temp 2 (0xBD) v GPU Temp 3 (0xBE) v GPU Temp 4 (0xBF)
v Mem Buf Temp 1 (0xC0) v Mem Buf Temp 2 (0xC1) v Mem Buf Temp 3 (0xC2) v Mem Buf Temp 4 (0xC3) v Mem Buf Temp 5 (0xC4) v Mem Buf Temp 6 (0xC5) v Mem Buf Temp 7 (0xC6) v Mem Buf Temp 8 (0xC7)
v Uncorrectable Memory Error v Parity v Memory Scrub Failed v Memory Device Disabled v Configuration Error v Memory Automatically Throttled
v Correctable Memory Error v Parity v Correctable Memory Error Logging
Limit Reached
v Presence Detected v Spare v Critical Over temperature
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
If the sensor name is GPU Func 1, replace GPU 1. If the sensor name is GPU Func 2, replace GPU 2. And so on. Go to “8335-GTB locations” on page 121 to identify the physical location and removal and replacement procedure.
No service action is required.
No service action is required.
No service action is required.
72 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v CPU Diode 1 (0xC8) v CPU Diode 2 (0xCB)
v Lower Non-critical – going low v Lower Non-critical – going high
No service action is required.
v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Beginning troubleshooting and problem analysis 73
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
Checkstop (0xC9) IERR If this event immediately precedes a
system power off, no service action is required. Otherwise, search for SEL events that meet the following criteria:
v The event has a time stamp close
to the time stamp of this event.
v A service action keyword is
present. For a list of service action keywords, see “Identifying service action keywords in system event logs” on page 36.
v Asserted is in the description.
If you found a SEL event that matches the criteria, perform the service action that is indicated in this table for the SEL event. Otherwise, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page
110.
v Thermal Trip v Configuration Error v Processor Automatically Throttled v Correctable Machine Check Error v Processor Presence Detected
v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup
Initialization Failure
v SMBIOS Uncorrectable CPU
Complex Error
v Processor Disabled v Terminator Presence Detected v Machine Check Exception
No service action is required.
Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page
110.
74 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v PSU Fault 1 (0xCD) v PSU Fault 2 (0xCE)
Power Supply Failure Detected An assert event immediately
followed by a deassert event
indicates that a power cycle of the
system occurred. No service action is
required. If there is no deassert event
immediately following the assert
event, replace the power supply. If
the sensor name is PSU Fault 1,
replace PSU 1. If the sensor name is
PSU Fault 2, replace PSU 2. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
v Predictive Failure v Power Supply Input Out of Range
But Present
If the sensor name is PSU Fault 1,
replace PSU 1. If the sensor name is
PSU Fault 2, replace PSU 2. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
v Power Supply Input Lost or AC
DC
v Power Supply Input Lost Or Out
Of Range
Ensure that ac power is supplied to
the rack. Ensure that the system
power cords are plugged tightly into
both the power supply and the rack
PDU unit for both system power
supplies. Go to “8335-GTB locations”
on page 121 to identify the physical
location and removal and
replacement procedure.
Configuration Error Ensure that both power supplies are
securely seated in the system. Go to
“8335-GTB locations” on page 121 to
identify the physical location and
removal and replacement procedure.
v Presence Detected
No service action is required.
v Power Supply Inactive
CPU VDD Volt (0xCF)
v Lower Non-critical – going low
No service action is required.
v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Beginning troubleshooting and problem analysis 75
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
CPU VDD Curr (0xD0)
BIOS Golden Side (0xD2) None Go to “Resolving a system firmware
BMC Golden Side (0xD3) None Go to “Resolving a system firmware
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
No service action is required.
boot failure” on page 4 and follow the service action for a system event log (SEL) with the value OEM record c0 and OEM c0 specific log information 3a1504xxxxxx.
boot failure” on page 4 and follow the service action for a system event log (SEL) with the value OEM record c0 and OEM c0 specific log information 3a1504xxxxxx.
76 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v Fan 1 (0xD4) v Fan 2 (0xD5) v Fan 3 (0xD6) v Fan 4 (0xD7)
v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high
No service action is required.
v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
v Device Inserted/Device Present
v Device Removed/Device Absent v Transition to degraded v Install error v Redundancy lost
Ensure that all fans are seated
securely. Go to “8335-GTB locations”
on page 121 to identify the physical
location and removal and
replacement procedure.
v Non-redundant insufficient
resources
CurPwr Redundant (0xD8)
v State Deasserted
No service action is required.
v State Asserted
NxtPwr Redundant (0xD9)
v State Deasserted
No service action is required.
v State Asserted
Turbo Allowed (0xDA)
v State Deasserted
No service action is required.
v State Asserted
v Freq Limit OT 1 (0xDB) v Freq Limit OT 2 (0xDF)
v Freq Limit Pwr 1 (0xDC) v Freq Limit Pwr 2 (0xE0)
v Mem Thrtl OT 1 (0xDD) v Mem Thrtl OT 2 (0xE1)
v State Deasserted v State Asserted
v State Deasserted v State Asserted
v State Deasserted v State Asserted
No service action is required.
No service action is required.
No service action is required.
Beginning troubleshooting and problem analysis 77
Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)
Sensor name (Sensor ID) Event description Service action
v Quick Pwr Drop 1 (0xDE) v Quick Pwr Drop 2 (0xE2)
Water Cooled (0xE3) None No service action is required. CPU 1 VDD Temp (0xE4) Upper Critical - going high If the system is a water-cooled
CPU 2 VDD Temp (0xE5) Upper Critical - going high If the system is a water-cooled
State Deasserted No service action is required. State Asserted
v Ensure that ac power is supplied
to the rack.
v Ensure that the power supply
cords are plugged tightly into the power supplies and the rack PDU unit.
v Check for service action required
SEL events for the power supply sensor. If any exist, follow the service action that is specified in “Identifying a service action by using sensor and event information for the 8335-GTB” on page 57.
system, go to “Resolving an over temperature problem for a water-cooled 8335-GTB system” on page 26. If the system is an air-cooled system, ensure that there are no air flow obstructions at the front or at the rear of the system. Ensure that the fans are operating properly.
system, go to “Resolving an over temperature problem for a water-cooled 8335-GTB system” on page 26. If the system is an air-cooled system, ensure that there are no air flow obstructions at the front or at the rear of the system. Ensure that the fans are operating properly.
Identifying a service action by using sensor and event information for the 8348-21C
You can use the sensor and event information from the system event log to determine a service action to perform for the IBM Power System S812LC (8348-21C).
If you have not done so already, complete “Identifying a service action by using system event logs” on page 27. Then, use the following table to determine the service action to perform.
78 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 23. Sensor information, event description, and service action for the 8348-21C
Sensor name (Sensor ID) Event description Service action
Watchdog (0x00)
v Timer Expired
No service action is required.
v Reserved1 v Reserved2 v Reserved3 v Reserved4
v Hard Reset v Power Down v Power Cycle v Timer Interrupt
SEL events with OEM record c0 |
000e000 | 3a150xxxxxxx indicate that
a boot failed. Search for boot failure
SEL events that have a time stamp in
close proximity to the time stamp of
this SEL event. If events exist, go to
“Resolving a system firmware boot
failure” on page 4. If there are no
boot failure SEL events and the
system booted correctly, no service
action is required. Host Status (0x04) Unknown Go to Getting fixes and update the
system firmware to the most recent
level of firmware that is available. If
this SEL event continues to be logged
each time you power on the system,
go to “Collecting diagnostic data” on
page 109. Then, go to “Contacting
IBM service and support” on page
110.
v S0/Go “Working”
No service action is required.
v S1 “Sleeping with system h/w &
processor context maintained”
v S2 “sleeping, processor context
lost”
v S3 “sleeping, processor & h/w
context lost, memory retained”
v S4 “non-volatile sleep / suspend-to
disk”
v S5 / G2: “soft-off” v S4 / S5: “soft-off” v G3 mechanical Off v Sleeping in an S1/S2/S3 State v G1: Sleeping v S5: entered by override v Legacy ON state v Legacy OFF state
Beginning troubleshooting and problem analysis 79
Table 23. Sensor information, event description, and service action for the 8348-21C (continued)
Sensor name (Sensor ID) Event description Service action
FW Boot Progress (0x05)
OCC Active (0x08) Device Disabled Replace the system processor. Go to
Ambient Temp (0x0A)
v System Firmware Error v System Firmware Hang
System Firmware Progress No service action is required.
v State Deasserted v Device Enabled
v Upper Critical - going low v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Lower Critical - going low v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Upper Critical - going high Ensure that the room temperature
SEL events with OEM record c0 | 000e000 | 3a150xxxxxxx indicate that a boot failed. Search for boot failure SEL events that have a time stamp in close proximity to the time stamp of this SEL event. If events exist, go to “Resolving a system firmware boot failure” on page 4.
“8348-21C locations” on page 133 to identify the physical location and removal and replacement procedure.
No service action is required.
No service action is required.
meets the requirements that are specified for the system. Ensure that no obstructions are blocking air flow to the system.
80 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 23. Sensor information, event description, and service action for the 8348-21C (continued)
Sensor name (Sensor ID) Event description Service action
CPU Temp (0x64)
v Lower Non-critical – going low
No service action is required.
v Lower Non-critical – going high v Lower Critical - going low v Lower Critical – going high v Lower Non-recoverable – going
low
v Lower Non-recoverable – going
high
v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Lower Critical - going low v Upper Non-recoverable – going
low
v Upper Non-recoverable – going
high
Beginning troubleshooting and problem analysis 81
Table 23. Sensor information, event description, and service action for the 8348-21C (continued)
Sensor name (Sensor ID) Event description Service action
CPU Func (0x4E)
v IERR v Transition to Non-recoverable v Predictive Failure
v Processor Disabled v Thermal Trip v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup
Initialization Failure
v Configuration Error v SMBIOS Uncorrectable CPU
Complex Error
v Terminator Presence Detected v Processor Automatically Throttled v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Processor Presence Detected v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
Replace the system processor. Go to “8348-21C locations” on page 133 to identify the physical location and removal and replacement procedure.
No service action is required.
82 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Table 23. Sensor information, event description, and service action for the 8348-21C (continued)
Sensor name (Sensor ID) Event description Service action
All PGood (0x1C)
v Interlock Power Down
No service action is required.
v Power Off Power Down v Power Cycle v 240VA Power Down
v AC Lost v Soft Power Control Failure
v Power Unit Failure Detected v Predictive Failure
v Ensure that ac power is supplied
to the rack.
v Ensure that the system power
cords are plugged tightly into both the power supply and the rack power distribution unit (PDU) for both system power supplies.
v Ensure that the system was not
powered off.
v Ensure that ac power is supplied
to the rack.
v Ensure that the power supply
cords are plugged tightly into the power supplies and the rack PDU unit.
v Ensure that the system was not
powered off.
v Check for service action required
SEL events for the power supply sensor. If any exist, follow the service action that is specified in “Identifying a service action by using sensor and event information for the 8348-21C” on page 78.
Beginning troubleshooting and problem analysis 83
Table 23. Sensor information, event description, and service action for the 8348-21C (continued)
Sensor name (Sensor ID) Event description Service action
v DIMM Func 0 (0x1E) v DIMM Func 1 (0x1F) v DIMM Func 2 (0x20) v DIMM Func 3 (0x21) v DIMM Func 4 (0x22) v DIMM Func 5 (0x23) v DIMM Func 6 (0x24) v DIMM Func 7 (0x25) v DIMM Func 8 (0x26) v DIMM Func 9 (0x27) v DIMM Func 10 (0x28) v DIMM Func 11 (0x29) v DIMM Func 12 (0x2A) v DIMM Func 13 (0x2B) v DIMM Func 14 (0x2C) v DIMM Func 15 (0x2D) v DIMM Func 16 (0x2E) v DIMM Func 17 (0x2F) v DIMM Func 18 (0x30) v DIMM Func 19 (0x31) v DIMM Func 20 (0x32) v DIMM Func 21 (0x33) v DIMM Func 22 (0x34) v DIMM Func 23 (0x35) v DIMM Func 24 (0x36) v DIMM Func 25 (0x37) v DIMM Func 26 (0x38) v DIMM Func 27 (0x39) v DIMM Func 28 (0x3A) v DIMM Func 29 (0x3B) v DIMM Func 30 (0x3C) v DIMM Func 31 (0x3D)
v Memory Device Disabled v Uncorrectable Memory Error v Memory Scrub Failed v State Deasserted v Device Disabled v Transition to Critical from Less
Severe
v Transition to Non-recoverable from
Less Severe
v Transition to Critical from
Non-recoverable
v Correctable Memory Error v Parity v Correctable Memory Error Logging
Limit Reached
v Memory Automatically Throttled v Critical Over temperature v Presence Detected v Spare v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from
More Severe
v Monitor v Informational
v Transition to Non-recoverable v Predictive Failure
No service action is required.
If the sensor name is DIMM Func 0, replace DIMM 0. If the sensor name is DIMM Func 1, replace DIMM 1. And so on. Go to “8348-21C locations” on page 133 to identify the physical location and removal and replacement procedure.
84 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C
Loading...