IBM Power System 8335-GCA, Power System S812LC, Power System 8335-GTB, Power System S822LC, Power System 8348-21C User Manual

...

Page 1

Power Systems

Problem analysis, system parts, and locations for the IBM Power System S822LC (8335-GCA, 8335-GTA, and 8335-GTB), and IBM Power System S812LC (8348-21C)

IBM

Page 2

Page 3

Power Systems

Problem analysis, system parts, and locations for the IBM Power System S822LC (8335-GCA, 8335-GTA, and 8335-GTB), and IBM Power System S812LC (8348-21C)

IBM

Page 4

Note

Before using this information and the product it supports, read the information in “Safety notices” on page v, “Notices” on page 145, the IBM Systems Safety Notices manual, G229-9054, and the IBM Environmental Notices and User Guide, Z125–5823.

This edition applies to IBM Power Systems™servers that contain the POWER8®processor and to all associated models.

US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Page 5

Safety notices ................................. v

Beginning troubleshooting and problem analysis .................. 1

Determining the problem analysis procedure to perform..................... 1

Resolving a BMC access problem ............................ 2

Resolving a power problem .............................. 3

Resolving a system firmware boot failure.......................... 4

Resolving a VGA monitor problem ............................ 8

Resolving an operating system boot failure ......................... 9

Resolving a sensor indicator problem ........................... 11

Resolving a hardware problem ............................. 12

Resolving a GPU, PCIe adapter, or device problem ...................... 13

Resolving a RAID adapter problem .......................... 14

Resolving a network adapter problem ......................... 15

Resolving a graphics processing unit problem ....................... 16

Resolving an NVMe Flash adapter problem ....................... 19

Resolving a storage device problem .......................... 20

Identifying the location of the PCIe adapter by using the slot number ............... 21

Identifying the location of the GPU .......................... 22

Identifying the location of the NVMe Flash adapter ..................... 23

Identifying the location of the storage device ....................... 24

User guides for GPUs and PCIe adapters ........................ 25

Resolving an over temperature problem for a water-cooled 8335-GTB system ............. 26

Identifying a service action .............................. 27

Identifying a service action by using system event logs .................... 27

Identifying service action keywords in system event logs ................... 36

Identifying a service action by using sensor and event information ................ 37

Identifying a service action by using sensor and event information for the 8335-GCA and 8335-GTA ... 37

Identifying a service action by using sensor and event information for the 8335-GTB ......... 57

Identifying a service action by using sensor and event information for the 8348-21C ......... 78

Isolation procedures ................................ 96

EPUB_PRC_FIND_DECONFIGURE_PART isolation procedure ................. 96

EPUB_PRC_SP_CODE isolation procedure ........................ 97

EPUB_PRC_PHYP_CODE isolation procedure ....................... 97

EPUB_PRC_ALL_PROCS isolation procedure ....................... 98

EPUB_PRC_ALL_MEMCRDS isolation procedure...................... 98

EPUB_PRC_LVL_SUPPORT isolation procedure ...................... 99

EPUB_PRC_MEMORY_PLUGGING_ERROR isolation procedure ................ 100

EPUB_PRC_FSI_PATH isolation procedure ....................... 100

EPUB_PRC_PROC_AB_BUS isolation procedure...................... 101

EPUB_PRC_PROC_XYZ_BUS isolation procedure ..................... 101

EPUB_PRC_EIBUS_ERROR isolation procedure ...................... 102

EPUB_PRC_POWER_ERROR isolation procedure ..................... 103

EPUB_PRC_MEMORY_UE isolation procedure ...................... 104

EPUB_PRC_HB_CODE isolation procedure ....................... 104

EPUB_PRC_TOD_CLOCK_ERR isolation procedure .................... 106

EPUB_PRC_COOLING_SYSTEM_ERR isolation procedure .................. 106

EPUB_PRC_GPU_ISOLATION_PROCEDURE isolation procedure ................ 107

Verifying a repair ................................. 108

Collecting diagnostic data .............................. 109

Contacting IBM service and support ........................... 110

Finding parts and locations .......................... 111

8335-GCA and 8335-GTA locations ........................... 111

8335-GCA and 8335-GTA parts ............................ 115

Page 6

Finding parts and locations .......................... 121

8335-GTB locations ................................ 121

8335-GTB parts.................................. 125

Finding parts and locations .......................... 133

8348-21C locations................................. 133

8348-21C parts .................................. 138

Notices ................................... 145

Accessibility features for IBM Power Systems servers ..................... 146

Trademarks ................................... 148

Electronic emission notices .............................. 148

Class A Notices................................. 148

Class B Notices ................................. 152

Terms and conditions................................ 155

iv Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 7

Safety notices

Safety notices may be printed throughout this guide: v DANGER notices call attention to a situation that is potentially lethal or extremely hazardous to

people.

v CAUTION notices call attention to a situation that is potentially hazardous to people because of some

existing condition.

v Attention notices call attention to the possibility of damage to a program, device, system, or data.

World Trade safety information

Several countries require the safety information contained in product publications to be presented in their national languages. If this requirement applies to your country, safety information documentation is included in the publications package (such as in printed documentation, on DVD, or as part of the product) shipped with the product. The documentation contains the safety information in your national language with references to the U.S. English source. Before using a U.S. English publication to install, operate, or service this product, you must first become familiar with the related safety information documentation. You should also refer to the safety information documentation any time you do not clearly understand any safety information in the U.S. English publications.

Replacement or additional copies of safety information documentation can be obtained by calling the IBM Hotline at 1-800-300-8751.

German safety information

Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne § 2 der Bildschirmarbeitsverordnung geeignet.

Laser safety information

IBM®servers can use I/O cards or features that are fiber-optic based and that utilize lasers or LEDs.

Laser compliance

IBM servers may be installed inside or outside of an IT equipment rack.

DANGER: When working on or around the system, observe the following precautions:

Electrical voltage and current from power, telephone, and communication cables are hazardous. To avoid a shock hazard: v If IBM supplied the power cord(s), connect power to this unit only with the IBM provided power cord.

Do not use the IBM provided power cord for any other product.

v Do not open or service any power supply assembly. v Do not connect or disconnect any cables or perform installation, maintenance, or reconfiguration of this

product during an electrical storm.

v The product might be equipped with multiple power cords. To remove all hazardous voltages,

disconnect all power cords. – For AC power, disconnect all power cords from their AC power source. – For racks with a DC power distribution panel (PDP), disconnect the customer’s DC power source to

the PDP.

v When connecting power to the product ensure all power cables are properly connected.

Page 8

– For racks with AC power, connect all power cords to a properly wired and grounded electrical

outlet. Ensure that the outlet supplies proper voltage and phase rotation according to the system rating plate.

– For racks with a DC power distribution panel (PDP), connect the customer’s DC power source to

the PDP. Ensure that the proper polarity is used when attaching the DC power and DC power return wiring.

v Connect any equipment that will be attached to this product to properly wired outlets. v When possible, use one hand only to connect or disconnect signal cables. v Never turn on any equipment when there is evidence of fire, water, or structural damage. v Do not attempt to switch on power to the machine until all possible unsafe conditions are corrected. v Assume that an electrical safety hazard is present. Perform all continuity, grounding, and power checks

specified during the subsystem installation procedures to ensure that the machine meets safety requirements.

v Do not continue with the inspection if any unsafe conditions are present. v Before you open the device covers, unless instructed otherwise in the installation and configuration

procedures: Disconnect the attached AC power cords, turn off the applicable circuit breakers located in the rack power distribution panel (PDP), and disconnect any telecommunications systems, networks, and modems.

DANGER:

v Connect and disconnect cables as described in the following procedures when installing, moving, or

opening covers on this product or attached devices. To Disconnect:

1. Turn off everything (unless instructed otherwise).

2. For AC power, remove the power cords from the outlets.

3. For racks with a DC power distribution panel (PDP), turn off the circuit breakers located in the

PDP and remove the power from the Customer's DC power source.

4. Remove the signal cables from the connectors.

5. Remove all cables from the devices.

To Connect:

1. Turn off everything (unless instructed otherwise).

2. Attach all cables to the devices.

3. Attach the signal cables to the connectors.

4. For AC power, attach the power cords to the outlets.

5. For racks with a DC power distribution panel (PDP), restore the power from the Customer's DC

power source and turn on the circuit breakers located in the PDP.

6. Turn on the devices. Sharp edges, corners and joints may be present in and around the system. Use care when handling

equipment to avoid cuts, scrapes and pinching. (D005)

(R001 part 1 of 2):

DANGER: Observe the following precautions when working on or around your IT rack system:

v Heavy equipment–personal injury or equipment damage might result if mishandled. v Always lower the leveling pads on the rack cabinet. v Always install stabilizer brackets on the rack cabinet. v To avoid hazardous conditions due to uneven mechanical loading, always install the heaviest devices

in the bottom of the rack cabinet. Always install servers and optional devices starting from the bottom of the rack cabinet.

v Rack-mounted devices are not to be used as shelves or work spaces. Do not place objects on top of

rack-mounted devices. In addition, do not lean on rack mounted devices and do not use them to stabilize your body position (for example, when working from a ladder).

vi Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 9

v Each rack cabinet might have more than one power cord.

– For AC powered racks, be sure to disconnect all power cords in the rack cabinet when directed to

disconnect power during servicing.

– For racks with a DC power distribution panel (PDP), turn off the circuit breaker that controls the

power to the system unit(s), or disconnect the customer’s DC power source, when directed to disconnect power during servicing.

v Connect all devices installed in a rack cabinet to power devices installed in the same rack cabinet. Do

not plug a power cord from a device installed in one rack cabinet into a power device installed in a different rack cabinet.

v An electrical outlet that is not correctly wired could place hazardous voltage on the metal parts of the

system or the devices that attach to the system. It is the responsibility of the customer to ensure that the outlet is correctly wired and grounded to prevent an electrical shock.

(R001 part 2 of 2):

CAUTION:

v Do not install a unit in a rack where the internal rack ambient temperatures will exceed the

manufacturer's recommended ambient temperature for all your rack-mounted devices.

v Do not install a unit in a rack where the air flow is compromised. Ensure that air flow is not blocked

or reduced on any side, front, or back of a unit used for air flow through the unit.

v Consideration should be given to the connection of the equipment to the supply circuit so that

overloading of the circuits does not compromise the supply wiring or overcurrent protection. To provide the correct power connection to a rack, refer to the rating labels located on the equipment in the rack to determine the total power requirement of the supply circuit.

v (For sliding drawers.) Do not pull out or install any drawer or feature if the rack stabilizer brackets are

not attached to the rack. Do not pull out more than one drawer at a time. The rack might become unstable if you pull out more than one drawer at a time.

v (For fixed drawers.) This drawer is a fixed drawer and must not be moved for servicing unless specified

by the manufacturer. Attempting to move the drawer partially or completely out of the rack might cause the rack to become unstable or cause the drawer to fall out of the rack.

Safety notices vii

Page 10

CAUTION: Removing components from the upper positions in the rack cabinet improves rack stability during relocation. Follow these general guidelines whenever you relocate a populated rack cabinet within a room or building.

v Reduce the weight of the rack cabinet by removing equipment starting at the top of the rack

cabinet. When possible, restore the rack cabinet to the configuration of the rack cabinet as you received it. If this configuration is not known, you must observe the following precautions:

– Remove all devices in the 32U position (compliance ID RACK-001 or 22U (compliance ID RR001)

and above. – Ensure that the heaviest devices are installed in the bottom of the rack cabinet. – Ensure that there are little-to-no empty U-levels between devices installed in the rack cabinet

below the 32U (compliance ID RACK-001 or 22U (compliance ID RR001) level, unless the

received configuration specifically allowed it.

v If the rack cabinet you are relocating is part of a suite of rack cabinets, detach the rack cabinet from

the suite.

v If the rack cabinet you are relocating was supplied with removable outriggers they must be

reinstalled before the cabinet is relocated.

v Inspect the route that you plan to take to eliminate potential hazards. v Verify that the route that you choose can support the weight of the loaded rack cabinet. Refer to the

documentation that comes with your rack cabinet for the weight of a loaded rack cabinet.

v Verify that all door openings are at least 760 x 230 mm (30 x 80 in.). v Ensure that all devices, shelves, drawers, doors, and cables are secure. v Ensure that the four leveling pads are raised to their highest position. v Ensure that there is no stabilizer bracket installed on the rack cabinet during movement. v Do not use a ramp inclined at more than 10 degrees. v When the rack cabinet is in the new location, complete the following steps:

– Lower the four leveling pads. – Install stabilizer brackets on the rack cabinet. – If you removed any devices from the rack cabinet, repopulate the rack cabinet from the lowest

position to the highest position.

v If a long-distance relocation is required, restore the rack cabinet to the configuration of the rack

cabinet as you received it. Pack the rack cabinet in the original packaging material, or equivalent. Also lower the leveling pads to raise the casters off of the pallet and bolt the rack cabinet to the pallet.

(R002)

(L001)

DANGER: Hazardous voltage, current, or energy levels are present inside any component that has this

label attached. Do not open any cover or barrier that contains this label. (L001)

(L002)

viii Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 11

DANGER: Rack-mounted devices are not to be used as shelves or work spaces. (L002)

1 2

(L003)

Safety notices ix

Page 12

DANGER: Multiple power cords. The product might be equipped with multiple AC power cords or multiple DC power cables. To remove all hazardous voltages, disconnect all power cords and power cables. (L003)

(L007)

CAUTION: A hot surface nearby. (L007)

(L008)

x Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 13

CAUTION: Hazardous moving parts nearby. (L008)

All lasers are certified in the U.S. to conform to the requirements of DHHS 21 CFR Subchapter J for class 1 laser products. Outside the U.S., they are certified to be in compliance with IEC 60825 as a class 1 laser product. Consult the label on each part for laser certification numbers and approval information.

CAUTION: This product might contain one or more of the following devices: CD-ROM drive, DVD-ROM drive, DVD-RAM drive, or laser module, which are Class 1 laser products. Note the following information:

v Do not remove the covers. Removing the covers of the laser product could result in exposure to

hazardous laser radiation. There are no serviceable parts inside the device.

v Use of the controls or adjustments or performance of procedures other than those specified herein

might result in hazardous radiation exposure.

(C026)

CAUTION: Data processing environments can contain equipment transmitting on system links with laser modules that operate at greater than Class 1 power levels. For this reason, never look into the end of an optical fiber cable or open receptacle. Although shining light into one end and looking into the other end of a disconnected optical fiber to verify the continuity of optic fibers many not injure the eye, this procedure is potentially dangerous. Therefore, verifying the continuity of optical fibers by shining light into one end and looking at the other end is not recommended. To verify continuity of a fiber optic cable, use an optical light source and power meter. (C027)

CAUTION: This product contains a Class 1M laser. Do not view directly with optical instruments. (C028)

CAUTION: Some laser products contain an embedded Class 3A or Class 3B laser diode. Note the following information: laser radiation when open. Do not stare into the beam, do not view directly with optical instruments, and avoid direct exposure to the beam. (C030)

CAUTION: The battery contains lithium. To avoid possible explosion, do not burn or charge the battery.

Do Not:

v ___ Throw or immerse into water v ___ Heat to more than 100°C (212°F) v ___ Repair or disassemble

Exchange only with the IBM-approved part. Recycle or discard the battery as instructed by local regulations. In the United States, IBM has a process for the collection of this battery. For information, call 1-800-426-4333. Have the IBM part number for the battery unit available when you call. (C003)

Safety notices xi

Page 14

CAUTION: Regarding IBM provided VENDOR LIFT TOOL:

v Operation of LIFT TOOL by authorized personnel only. v LIFT TOOL intended for use to assist, lift, install, remove units (load) up into rack elevations. It is

not to be used loaded transporting over major ramps nor as a replacement for such designated tools like pallet jacks, walkies, fork trucks and such related relocation practices. When this is not practicable, specially trained persons or services must be used (for instance, riggers or movers).

v Read and completely understand the contents of LIFT TOOL operator's manual before using.

Failure to read, understand, obey safety rules, and follow instructions may result in property damage and/or personal injury. If there are questions, contact the vendor's service and support. Local paper manual must remain with machine in provided storage sleeve area. Latest revision manual available on vendor's web site.

v Test verify stabilizer brake function before each use. Do not over-force moving or rolling the LIFT

TOOL with stabilizer brake engaged.

v Do not move LIFT TOOL while platform is raised, except for minor positioning. v Do not exceed rated load capacity. See LOAD CAPACITY CHART regarding maximum loads at

center versus edge of extended platform.

v Only raise load if properly centered on platform. Do not place more than 200 lb (91 kg) on edge of

sliding platform shelf also considering the load's center of mass/gravity (CoG).

v Do not corner load the platform tilt riser accessory option. Secure platform riser tilt option to main

shelf in all four (4x) locations with provided hardware only, prior to use. Load objects are designed to slide on/off smooth platforms without appreciable force, so take care not to push or lean. Keep riser tilt option flat at all times except for final minor adjustment when needed.

v Do not stand under overhanging load. v Do not use on uneven surface, incline or decline (major ramps). v Do not stack loads. v Do not operate while under the influence of drugs or alcohol. v Do not support ladder against LIFT TOOL. v Tipping hazard. Do not push or lean against load with raised platform. v Do not use as a personnel lifting platform or step. No riders. v Do not stand on any part of lift. Not a step. v Do not climb on mast. v Do not operate a damaged or malfunctioning LIFT TOOL machine. v Crush and pinch point hazard below platform. Only lower load in areas clear of personnel and

obstructions. Keep hands and feet clear during operation.

v No Forks. Never lift or move bare LIFT TOOL MACHINE with pallet truck, jack or fork lift. v Mast extends higher than platform. Be aware of ceiling height, cable trays, sprinklers, lights, and

other overhead objects.

v Do not leave LIFT TOOL machine unattended with an elevated load. v Watch and keep hands, fingers, and clothing clear when equipment is in motion. v Turn Winch with hand power only. If winch handle cannot be cranked easily with one hand, it is

probably over-loaded. Do not continue to turn winch past top or bottom of platform travel. Excessive unwinding will detach handle and damage cable. Always hold handle when lowering, unwinding. Always assure self that winch is holding load before releasing winch handle.

v A winch accident could cause serious injury. Not for moving humans. Make certain clicking sound

is heard as the equipment is being raised. Be sure winch is locked in position before releasing handle. Read instruction page before operating this winch. Never allow winch to unwind freely. Freewheeling will cause uneven cable wrapping around winch drum, damage cable, and may cause serious injury. (C048)

Power and cabling information for NEBS (Network Equipment-Building System) GR-1089-CORE

The following comments apply to the IBM servers that have been designated as conforming to NEBS (Network Equipment-Building System) GR-1089-CORE:

xii Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 15

The equipment is suitable for installation in the following:

v Network telecommunications facilities v Locations where the NEC (National Electrical Code) applies

The intrabuilding ports of this equipment are suitable for connection to intrabuilding or unexposed wiring or cabling only. The intrabuilding ports of this equipment must not be metallically connected to the interfaces that connect to the OSP (outside plant) or its wiring. These interfaces are designed for use as intrabuilding interfaces only (Type 2 or Type 4 ports as described in GR-1089-CORE) and require isolation from the exposed OSP cabling. The addition of primary protectors is not sufficient protection to connect these interfaces metallically to OSP wiring.

Note: All Ethernet cables must be shielded and grounded at both ends.

The ac-powered system does not require the use of an external surge protection device (SPD).

The dc-powered system employs an isolated DC return (DC-I) design. The DC battery return terminal shall not be connected to the chassis or frame ground.

The dc-powered system is intended to be installed in a common bonding network (CBN) as described in GR-1089-CORE.

Safety notices xiii

Page 16

xiv Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 17

Beginning troubleshooting and problem analysis

This information provides a starting point for analyzing problems.

This information is the starting point for diagnosing and repairing systems. From this point, you are guided to the appropriate information to help you diagnose problems, determine the appropriate repair action, and then complete the necessary steps to repair the system.

Note: Update the system firmware to the latest level before you start problem analysis. If you update the system firmware, you will have the latest available fixes and improvements for error handling, reporting, and isolation. For instructions about updating the system firmware, see Getting fixes.

What type of problem are you dealing with? Problem analysis procedure

You do not know the type of problem. Go to “Determining the problem analysis procedure to

perform.”

A baseboard management controller (BMC) access problem occurred.

The system does not power on (the power button or the BMC power on command does not power on the system).

A system firmware boot failure occurred (the system started but was not able to boot to the Petitboot menu).

A video graphics array (VGA) monitor problem occurred (the system started but video is not displayed on the monitor).

An operating system boot failure occurred (the system booted to the Petitboot menu but the operating system did not start).

A BMC dashboard sensor is red. Go to “Resolving a sensor indicator problem” on page

A processor, memory, power, or cooling hardware failure occurred.

Missing or faulty graphics processing unit (GPU), PCIe adapter, disk drive, or solid-state drive.

Go to “Resolving a BMC access problem” on page 2.

Go to “Resolving a power problem” on page 3.

Go to “Resolving a system firmware boot failure” on page 4.

Go to “Resolving a VGA monitor problem” on page 8.

Go to “Resolving an operating system boot failure” on page 9.

11. Go to “Resolving a hardware problem” on page 12.

Go to Resolving a GPU, PCIe adapter, or device problem.

Determining the problem analysis procedure to perform

Learn how to identify the correct problem analysis procedure to perform.

To determine the correct problem analysis procedure to perform, complete the following steps:

1. After you apply power to the system, do the power supply LEDs display XXX and after 30 seconds

the power button flashes?

If Then Yes: Continue with the next step. No: Go to “Resolving a power problem” on page 3.

2. Can you access the baseboard management controller (BMC) across the network?

Page 18

If Then Yes: Continue with the next step. No: Go to “Resolving a BMC access problem.”

3. Can you boot the system to the Petitboot menu?

If Then Yes: Continue with the next step. No: Go to “Resolving a system firmware boot failure” on page 4.

4. Is video displayed on the video graphics array (VGA) monitor?

If Then Yes: Continue with the next step. No: Go to “Resolving a VGA monitor problem” on page 8.

5. Can you start the operating system?

If Then Yes: Continue with the next step. No: Go to “Resolving an operating system boot failure” on page 9.

6. On the BMC dashboard, are any sensors red?

If Then Yes: Go to “Resolving a sensor indicator problem” on page 11. No: Continue with the next step.

7. Go to “Resolving a hardware problem” on page 12. This ends the procedure.

Resolving a BMC access problem

Learn how to identify the service action that is needed to resolve a baseboard management controller (BMC) access problem.

1. Ensure that the BMC password is not set to the default password. For information about changing the

default password, see Logging on to the BMC GUI. Does the problem persist?

If Then Yes: Continue with the next step. No: This ends the procedure.

2. Are both ends of the network cable seated securely?

If Then Yes: Continue with the next step. No: Seat both ends of the cable securely. If the problem persists, continue with the next step.

3. Power off the system and disconnect all ac power cords for 30 seconds. Then, reconnect the ac power

cords and power on the system. Does the BMC access problem persist?

2 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 19

If Then Yes: Continue with the next step. No: This ends the procedure.

4. Verify that the BMC network settings are correct. a. Power on the system by using the power button on the front of the system. Wait 1 - 2 minutes for

the system to display the Petitboot menu.

b. When the Petitboot menu is displayed, press any key to interrupt the boot process. Then, select

Exit to Shell.

c. Type the following command and press Enter:

ipmitool lan print 1

d. Verify that the MAC address and the IP address settings are correct. Then, continue with the next

step.

Note: If the IP address setting is incorrect, go to Configuring the firmware IP address website(http://www.ibm.com/support/knowledgecenter/linuxonibm/liabw/ liabwenablenetwork.htm). If the MAC address is 00:00:00:00:00:00, go to “Contacting IBM service and support” on page 110.

5. Complete the following actions: a. Power on to the Petitboot menu. b. Use the BMC to update the system firmware. For instructions, see Updating the system firmware

by using the BMC.

Are you able to access the BMC?

If Then Yes: This ends the procedure. No: Continue with the next step.

6. Complete the service action that is indicated for your system:

v If your system is an 8335-GCA or 8335-GTA, replace the system backplane. Go to “8335-GCA and

8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure. This ends the procedure.

v If your system is an 8335-GTB, replace the BMC card. Go to “8335-GTB locations” on page 121 to

identify the physical location and the removal and replacement procedure. This ends the

procedure.

v If your system is an 8348-21C, replace the system backplane. Go to “8348-21C locations” on page

133 to identify the physical location and the removal and replacement procedure. This ends the

procedure.

Resolving a power problem

Learn how to identify the service action that is needed to resolve a power problem.

1. Is the amber LED of a power supply on solid and is the amber LED on the front of the system turned off?

If Then Yes: Ensure that the power cords for both power supplies are fully seated and that the power

distribution units (PDUs) and power outlets are supplying electricity. This ends the procedure.

No: Continue with the next step.

Beginning troubleshooting and problem analysis 3

Page 20

2. Are the power supply LEDs turned off?

If Then Yes: Continue with the next step. No: Continue with step 4.

3. Perform the following actions, one at a time, until the problem is resolved: a. Ensure that all of the power cords are fully seated in the power supplies. b. Ensure that all of the power cords are fully seated in the power distribution units (PDUs) or wall

outlets.

c. If the power cords are plugged into PDUs, ensure that the PDUs are turned on. d. Ensure that all of the power cords are plugged into PDUs or wall outlets that are supplying

electricity.

e. Replace the power cords. f. Replace the power supplies.

v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page

111 to identify the physical location and the removal and replacement procedure.

v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical

location and the removal and replacement procedure.

v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical

location and the removal and replacement procedure.

This ends the procedure.

4. Is the amber LED of a power supply on solid and is the red LED on the front of the system flashing at 0.25 Hz?

If Then Yes: Continue with the next step. No: Go to “Contacting IBM service and support” on page 110. This ends the procedure.

5. Perform the following actions, one at a time, until the problem is resolved: a. Ensure that the power supply is fully seated in the system. b. Ensure that the power supply fan is not blocked. c. Replace the power supply.

v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page

111 to identify the physical location and the removal and replacement procedure.

v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical

location and the removal and replacement procedure.

v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical

location and the removal and replacement procedure.

This ends the procedure.

Resolving a system firmware boot failure

Learn how to identify the service action that is needed to resolve a failure while booting your system firmware.

1. After you pressed the power button, did the system turn on but fail to display the Petitboot menu?

If Then Yes: Continue with the next step.

4 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 21

If Then No: Continue with step 5.

2. Does the baseboard management controller (BMC) respond to commands?

Note: To determine whether the BMC responds to commands, run the following ipmitool command:

ipmitool -I lanplus -U <username> -P <password> -H <bmc ip or bmc hostname> chassis status

If Then Yes: Continue with the next step. No: Continue with step 4.

3. Complete the following actions: a. Use the BMC to update the system firmware. For instructions, see Updating the system firmware

by using the BMC.

b. Check the system event logs. For instructions, see “Identifying a service action by using system

event logs” on page 27. Then, continue with step 5.

4. Complete the following actions, one at a time, until the problem is resolved: a. Reset the BMC remotely by entering the following command:

ipmitool -I lanplus -U <username> -P <password> -H <bmc ip or bmc hostname> mc reset cold

b. Disconnect the power cords from the system for 30 seconds. Reconnect the power cords, wait 5

minutes, and then go to step 2.

c. Use the IPMI tool to update the system firmware. For instructions, see Updating the system

firmware by using the IPMI tool.

d. Complete the service action that is indicated for your system:

v If your system is an 8335-GCA or 8335-GTA, replace the system backplane. Go to “8335-GCA

and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.

v If your system is an 8335-GTB, replace the BMC card. Go to “8335-GTB locations” on page 121

to identify the physical location and the removal and replacement procedure.

v If your system is an 8348-21C, replace the system backplane. Go to “8348-21C locations” on

page 133 to identify the physical location and the removal and replacement procedure.

This ends the procedure.

5. Are you here because of a system event log (SEL) with the value OEM record c0 and OEM c0 specific log information 3a1503xxxxxx?

If Then Yes: Continue with step 8 on page 6. No: Continue with the next step.

6. Are you here because of a SEL event with the value OEM record c0 and OEM c0 specific log information 3a1504xxxxxx?

If Then Yes: Continue with step 12 on page 7. No: Continue with the next step.

7. Power off the system and disconnect all ac power cords for 30 seconds. Then, reconnect the ac power cords and power on the system. Does the system boot successfully?

Beginning troubleshooting and problem analysis 5

Page 22

If Then Yes: This ends the procedure. No: Go to “Resolving a hardware problem” on page 12. This ends the procedure.

8. Did the system complete the boot process successfully?

If Then Yes: Continue with the next step. No: Continue with step 12 on page 7.

9. Determine whether the system is booted from the user-updated level of the system firmware image

(primary side) or the manufacturing level of the system firmware image (golden side). v For in-band networks, enter the following command:

ipmitool sensor list | grep -i golden

v To run the command remotely over the LAN, enter the following command:

ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sensor list | grep -i golden

Do both of the returned records show 0x0080 in the data fields?

If Then Yes: The error was temporary. No service action is required. This ends the procedure. No: One or both of the returned records have 0x0180 in the data fields. The system was booted

from the golden side. Continue with the next step.

10. Search for processor deconfiguration SEL events that have a time stamp in close proximity to the

time stamp of the event with value OEM record c0 that sent you here. Processor deconfiguration SEL events are displayed in the following form:

v Processor CPU Func x | Transition to Non-recoverable | Asserted Are processor deconfiguration events present?

If Then Yes: Complete the service actions for the processor deconfiguration events.

v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using

sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends

the procedure.

v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and

event information for the 8335-GTB” on page 57. This ends the procedure.

v If your system is an 8348-21C, go to “Identifying a service action by using sensor and

event information for the 8348-21C” on page 78. This ends the procedure.

No: Continue with the next step.

11. Are there other types of SEL events that require a service action and have a time stamp in close

proximity to the time stamp of the event with value OEM record c0 that sent you here?

6 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 23

If Then Yes: Complete the service actions for the SEL events that require service actions.

v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using

sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends

the procedure.

v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and

event information for the 8335-GTB” on page 57. This ends the procedure.

v If your system is an 8348-21C, go to “Identifying a service action by using sensor and

event information for the 8348-21C” on page 78. This ends the procedure.

No: If the boot problem persists, reload or update the system firmware image. Go to Getting

fixes and reload the system firmware with the same level of firmware or update the system firmware with a more recent level of firmware. Then, reboot the system. This ends the

procedure.

12. Search for processor deconfiguration SEL events that have a time stamp in close proximity to the

time stamp of the event with value OEM record c0 that sent you here. Processor deconfiguration SEL events are displayed in the following form:

v Processor CPU Func x | Transition to Non-recoverable | Asserted Are processor deconfiguration events present?

If Then Yes: Complete the service actions for the processor deconfiguration events.

v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using

sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends

the procedure.

v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and

event information for the 8335-GTB” on page 57. This ends the procedure.

v If your system is an 8348-21C, go to “Identifying a service action by using sensor and

event information for the 8348-21C” on page 78. This ends the procedure.

No: Continue with the next step.

13. Are there other types of SEL events that require a service action and have a time stamp in close

proximity to the time stamp of the event with value OEM record c0 that sent you here?

If Then Yes: Complete the service actions for the SEL events that require service actions.

v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using

sensor and event information for the 8335-GCA and 8335-GTA” on page 37. This ends

the procedure.

v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and

event information for the 8335-GTB” on page 57. This ends the procedure.

v If your system is an 8348-21C, go to “Identifying a service action by using sensor and

event information for the 8348-21C” on page 78. This ends the procedure.

No: Continue with the next step.

14. Power off the system and disconnect all AC power cords for 30 seconds. Then, reconnect the AC

power cords and power on the system. Does the system boot successfully?

If Then Yes: This ends the procedure. No: Continue with the next step.

Beginning troubleshooting and problem analysis 7

Page 24

15. Is the system an 8348-21C, and are all 32 of the DIMM locations populated with 32 GB DIMMs?

If Then Yes: Continue with the next step. No: Go to step 18.

16. Use the baseboard management controller (BMC) to update the system firmware. For instructions,

see Updating the system firmware by using the BMC. Does the problem persist?

If Then Yes: Continue with the next step. No: This ends the procedure.

17. Is your system is an 8335-GTB?

If Then Yes: Replace the Baseboard management controller (BMC) card. Go to “8335-GTB locations” on

page 121 to identify the physical location and the removal and replacement procedure. If the problem persists, continue with the next step. Otherwise, this ends the procedure.

No: Continue with the next step.

18. Replace the system backplane.

v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page

111 to identify the physical location and the removal and replacement procedure. Then, continue with the next step.

v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical

location and the removal and replacement procedure. Then, continue with the next step.

v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical

location and the removal and replacement procedure. Then, continue with the next step.

19. Does the problem persist?

If Then Yes: Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and

support” on page 110. This ends the procedure.

No: This ends the procedure.

Resolving a VGA monitor problem

Learn how to identify the service action that is needed to resolve a video graphics array (VGA) monitor problem.

1. Is the system powered on and is the VGA monitor connected to the VGA display port, but video is

not displayed?

If Then Yes: Continue with the next step. No: This ends the procedure.

2. Complete the following steps, one at a time until the problem is resolved: a. Ensure that the VGA cable is properly seated to the server port and to the monitor port.

8 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 25

b. Verify that the monitor and the VGA cable are working properly by testing them on a system that

is known to be working properly. If the monitor or the VGA cable does not work properly, replace it.

c. Verify that the system is powered on by activating a serial over LAN (SOL) session through the

baseboard management controller (BMC). If the system is not active, go to “Resolving a system firmware boot failure” on page 4.

d. Replace the system backplane.

v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page

111 to identify the physical location and the removal and replacement procedure.

v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical

location and the removal and replacement procedure.

v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical

location and the removal and replacement procedure.

This ends the procedure.

Resolving an operating system boot failure

Learn how to identify the service action that is needed to resolve a failure while booting your operating system.

1. Was the system recently installed, serviced, moved, or upgraded?

If Then Yes: Ensure that all cables are properly seated in the connection path to the designated boot

device. This ends the procedure.

No: Continue with the next step.

2. Are you booting the operating system from a network location?

If Then Yes: Continue with the next step. No: Continue with step 4.

3. Complete the following actions, one at a time, until the problem is resolved: a. Ensure that a problem does not exist with the connection to the network location. b. Ensure that the adapter has a valid IP address for the network. c. Replace the network adapter.

v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on

page 111 to identify the physical location and the removal and replacement procedure.

v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical

location and the removal and replacement procedure.

v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical

location and the removal and replacement procedure.

4. Petitboot displays all recognized bootable images to use by default. Is the boot image recognized by Petitboot?

If Then Yes: Continue with step 11 on page 11. No: Select the Petitboot menu option to refresh the boot images. If the problem persists,

continue with the next step.

Beginning troubleshooting and problem analysis 9

Page 26

5. Is the system an 8348-21C, and is the boot image on a storage device that is configured in a RAID

configuration?

If Then Yes: Continue with the next step. No: Continue with step 11 on page 11.

6. On the Petitboot command line, type the following command:

arcconf getconfig 1 LD

Is the logical boot drive recognized and in optimal status?

If Then Yes: Reinstall the operating system on the logical drive. This ends the procedure. No: Continue with the next step.

7. Are the drives properly seated in their respective drive bays?

Note:

v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page

111 to identify the physical location and the removal and replacement procedure.

v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical

location and the removal and replacement procedure.

v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical

location and the removal and replacement procedure.

If Then Yes: Continue with the next step. No: Properly seat the drives in the drive bays. Then, go to step 4 on page 9.

8. Refresh the Petitboot boot options. Is the boot image on the logical drive recognized?

If Then Yes: Boot the operating system. Then, continue with step 11 on page 11. No: Continue with the next step.

9. Verify that the physical drives are in the RAID array. On the Petitboot command line, type the

following command:

arcconf getconfig 1 PD

Are the physical drives that are known to be in the RAID array recognized?

If Then Yes: Reinstall the operating system on the logical drive. This ends the procedure. No: Continue with the next step.

10. Complete the following actions, one at a time, until the physical drives are recognized in the RAID

array:

Note:

v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page

111 to identify the physical location and the removal and replacement procedure.

10 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 27

v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the physical

location and the removal and replacement procedure.

v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the physical

location and the removal and replacement procedure.

a. Ensure that the SAS cable is securely seated in the RAID adapter and the storage backplane. b. Replace the RAID adapter. c. Replace the SAS cable.

This ends the procedure.

11. Does an operating system error occur during the boot?

If Then Yes: Recover the operating system with the tools provided for the operating system. If that does

not resolve the problem, reinstall the operating system. This ends the procedure.

No: Reinstall the operating system. This ends the procedure.

Resolving a sensor indicator problem

Learn how to resolve a sensor indicator problem by using the BMC dashboard.

After the system is powered on, some sensors retain their status from the last time the system was operational. As a result, the sensor indicator LED might not reflect the status of the physical sensor, and it can be unclear whether the sensor indicator LED indicates an actual problem that requires a service action. For more information about BMC dashboard sensors on an 8335-GCA or 8335-GTA, see Event sensor status GUI display. For more information about BMC dashboard sensors on an 8335-GTB, see Event sensor status GUI display. For more information about BMC dashboard sensors on an 8348-21C, see Event sensor status GUI display.

To refresh the sensor indicator LEDs and to determine whether a service action is required, complete the following procedure:

1. Power off the system. Then, boot the system to the operational state. Click Refresh on the BMC

dashboard. Are any of the sensor indicator LEDs still red?

v Yes: Continue with the next step. v No: This ends the procedure.

2. Record the names of any sensors that have a red LED indicator status.

Note: Repeat steps 3 - 6 for every sensor that you record in this step.

3. Use one of the following commands to list the sensor event logs (SELs).

v To list SELs by using an in-band network, enter the following command:

ipmitool sel elist

v To list SELs remotely over the LAN, enter the following command:

ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel elist

4. Review the list of SELs and locate the log entry that meets the following criteria:

v The name of any of the sensors you recorded in step 2. v A service action keyword is present. For a list of service action keywords, see “Identifying service

action keywords in system event logs” on page 36.

v Asserted is in the description. Did you identify a log entry that meets the above criteria? v Yes: Continue with the next step.

Beginning troubleshooting and problem analysis 11

Page 28

v No: Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and

support” on page 110. This ends the procedure.

5. Use one of the following options to display the SEL details for the sensor:

Note: You must specify the SEL record ID in hexadecimal format. For example: 0x1a. v To display SEL details by using an in-band network, enter the following command:

ipmitool sel get <SEL record ID>

v To display SEL details remotely over the LAN, enter the following command:

ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel get <SEL record ID>

6. The sensor ID field contains sensor information in the sensor name (sensor ID) format. Record the

sensor name, sensor ID, and event description. Then, use this information to determine the service action to perform:

v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor and

event information for the 8335-GCA and 8335-GTA” on page 37 to determine the service action to perform. This ends the procedure.

v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event

information for the 8335-GTB” on page 57 to determine the service action to perform. This ends the

procedure.

v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event

information for the 8348-21C” on page 78 to determine the service action to perform. This ends the

procedure.

Resolving a hardware problem

Learn how to identify the service action that is needed to resolve a hardware problem.

1. If you have not already done so, manually boot the system.

2. Go to “Identifying a service action by using system event logs” on page 27. Then, continue with the

next step.

3. Was a service action identified?

If Then Yes: Continue with the next step. No: Go to step 5.

4. Did the service action fix the problem?

If Then Yes: This ends the procedure. No: Go to step 5.

5. Go to “Resolving a GPU, PCIe adapter, or device problem” on page 13. Then, continue with the next

step.

6. Was a service action identified?

If Then Yes: Continue with the next step. No: Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and

support” on page 110. This ends the procedure.

7. Did the service action fix the problem?

12 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 29

If Then Yes: This ends the procedure. No: Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and

support” on page 110. This ends the procedure.

Resolving a GPU, PCIe adapter, or device problem

Learn how to access log files, information to identify types of events, and a list of potential problems and service actions.

1. Are all of the adapters in the system missing or failed?

If Then Yes: Replace the system backplane.

v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations”

on page 111 to identify the physical location and the removal and replacement procedure.

v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the

physical location and the removal and replacement procedure.

v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the

physical location and the removal and replacement procedure.

No: Continue with the next step.

2. To identify the correct service procedure to perform by using operating system log information,

complete the following steps:

a. Log in as the root user. b. At the command prompt, type dmesg and press Enter.

3. Scan the operating system logs for the first occurrence of keywords, such as fail, failure, or failed.

When you find a keyword that accompanies one or more of the resource names in the following table, a service action is required. Use the following table to determine the service procedure to perform for your type of problem.

Table 1. Resource names, examples, and service procedures for different types of operating system logs.

Example of a log requiring

Resource name

aacraid PCI error detected 2 RAID

eth1, eth2, eth3 Failed to re-initialize

NVRM aborting RmInitAdapter

nvidia-nvlink IBMNPU: NPU FENCE

nvme Failed status: ffffffff,

a service action Type of problem Service procedure

Go to “Resolving a RAID

device

failed!

detected, machine power cycle required

reset controller

Note: This adapter is available only for 8348-21C systems.

Network Go to “Resolving a network

Graphics Go to “Resolving a

NVMe Flash adapter Note: This adapter is available only for 8335-GCA systems.

Beginning troubleshooting and problem analysis 13

adapter problem” on page

14.

adapter problem” on page

15.

graphics processing unit problem” on page 16.

Go to “Resolving an NVMe Flash adapter problem” on page 19.

Page 30

Table 1. Resource names, examples, and service procedures for different types of operating system logs. (continued)

Example of a log requiring

Resource name

ata1, ata2 SError: { RecovComm

sda, sdb, sdc FAILED Result Storage

a service action Type of problem Service procedure

PHYRdyChg 10B8B Dispar }

Marvell storage adapter Note: This adapter is available only for 8348-21C systems.

Go to “Resolving a storage device problem” on page

20.

Resolving a RAID adapter problem

Learn about the possible problems and service actions that you can perform to resolve a RAID adapter problem.

Note: To determine the location of the PCIe adapter, see “Identifying the location of the PCIe adapter by using the slot number” on page 21.

Table 2. RAID adapter problems and service actions.

Problem Service action

System unable to find adapter

Adapter stops working suddenly

1. Verify that the adapter is properly seated in a

compatible slot.

2. Install the adapter in a different compatible slot.

3. Verify that the drivers for the adapter are installed.

4. Verify that the most recent firmware is installed on

the system. Otherwise, install the most recent firmware if it is not already installed.

5. Restart the system.

6. Replace the adapter.

7. Replace the system backplane.

8. Replace the central processing unit (CPU).

1. If the system was recently installed, moved, serviced,

or upgraded, verify that the adapter is seated properly and all associated cables are connected correctly.

2. Inspect the PCIe socket and verify that there is no

dirt or debris in the socket.

3. Inspect the card and verify that it is not physically

damaged.

4. Verify that all cables are properly seated and are not

physically damaged. If you recently added one or more new adapters, remove them and then test to determine whether the failing adapter is functioning properly again. If the RAID adapter is functioning again, review the IBM support tips to confirm that there are no PCI address, driver, or firmware conflicts. Then, reinstall the new adapters again one at a time until all adapters function properly.

5. Replace the adapter.

6. Replace the system backplane.

7. Replace the CPU.

14 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 31

Table 2. RAID adapter problems and service actions (continued).

Problem Service action

One or more drives are not recognized

Other problems For information about adapter diagnostics, see

1. If more than one drive is not recognized, verify that

the cables are properly attached to the RAID card.

2. Verify that the drive or drives are fully seated in the

system.

3. Verify that all of the cables that attach to the

backplane are properly seated.

4. Verify that the drive or drives are compatible with

the RAID adapter.

5. Verify that the most recent firmware is installed for

the RAID adapter, or install the most recent firmware if it is not already installed.

6. If more than one drive is not recognized, replace the

drive.

7. Replace the RAID adapter.

8. Replace the system backplane.

9. Replace the cable or cables.

Supporting diagnostics. For information about adapter user information, see “User guides for GPUs and PCIe adapters” on page 25.

Resolving a network adapter problem

Learn about the possible problems and service actions that you can perform to resolve a network adapter problem.

Note: To determine the location of the PCIe adapter, see “Identifying the location of the PCIe adapter by using the slot number” on page 21.

Table 3. Network adapter problems and service actions.

Problem Service action

System unable to find adapter

1. Verify that the adapter is properly seated in a

compatible slot.

2. Install the adapter in a different compatible slot.

3. Verify that the drivers for the adapter are installed.

4. Verify that the most recent firmware is installed on

the system. Otherwise, install the most recent firmware if it is not already installed.

5. Restart the system.

6. Replace the adapter.

7. Replace the system backplane.

8. Replace the central processing unit (CPU).

Beginning troubleshooting and problem analysis 15

Page 32

Table 3. Network adapter problems and service actions (continued).

Problem Service action

Adapter stops working suddenly

Link indicator light on the adapter is off

Link light on the adapter is on, but there is no communication from the adapter

Other problems For information about adapter diagnostics, see

1. If the system was recently installed, moved, serviced,

or upgraded, verify that the adapter is seated properly and all associated cables are correctly connected.

2. Inspect the PCIe socket and verify that there is no

dirt or debris in the socket.

3. Inspect the card and verify that it is not physically

damaged.

4. Verify that all cables are properly seated and are not

physically damaged. If you recently added one or more new adapters, remove them and then test to determine whether the failing adapter is functioning properly again. If the network adapter is functioning again, review the IBM support tips to confirm that there are no PCI address, driver, or firmware conflicts. Then, reinstall the new adapters again one at a time until all adapters function properly.

5. Replace the adapter.

6. Replace the system backplane.

7. Replace the CPU.

1. Verify that the cable functions properly by testing it

with a known working connection.

2. Verify that the port or ports on the switch are

enabled and functional.

3. Verify that the switch and adapter are compatible.

4. Replace the adapter.

1. Verify that the most recent driver is installed, or

install the most recent driver if it is not already installed.

2. Verify that the adapter and its link have compatible

settings, such as speed and duplex configuration.

Supporting diagnostics. For information about adapter user information, see “User guides for GPUs and PCIe adapters” on page 25.

Resolving a graphics processing unit problem

Learn about the possible problems and service actions that you can perform to resolve a graphics processing unit (GPU) problem.

Note: To determine the location of the GPU, see “Identifying the location of the GPU” on page 22.

16 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 33

Table 4. GPU problems and service actions for the 8335-GCA or 8335-GTA

Problem Service action

System unable to find GPU

1. Verify that the GPU is properly seated in a

compatible slot.

2. Install the GPU in a different compatible slot.

3. Verify that the drivers for the GPU are installed.

4. Verify that the most recent firmware is installed on

the system. Otherwise, install the most recent firmware if it is not already installed.

5. Restart the system.

6. If the GPU is still missing, replace the following

items, one at a time, until the problem is resolved: Note: Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.

a. GPU b. System processor modules c. System backplane

GPU stops working suddenly

1. If the system was recently installed, moved, serviced,

or upgraded, verify that the GPU is seated properly and all associated cables are connected correctly.

2. Inspect the PCIe socket and verify that there is no

dirt or debris in the socket.

3. Inspect the card and verify that it is not physically

damaged.

4. Verify that all cables are properly seated and are not

physically damaged. If you recently added one or more new adapters, remove them and then test to determine whether the failing adapter is functioning properly again. If the graphics adapter is functioning again, review the IBM support tips to confirm that there are no PCI address, driver, or firmware conflicts. Then, reinstall the new adapters again one at a time until all adapters function properly.

5. If the GPU is still not working, replace the following

items, one at a time, until the problem is resolved: Note: Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.

a. GPU b. System processor modules c. System backplane

Other problems For information about adapter diagnostics, see

Supporting diagnostics. For information about adapter user information, see “User guides for GPUs and PCIe adapters” on page 25.

Beginning troubleshooting and problem analysis 17

Page 34

Table 5. GPU problems and service actions for the 8335-GTB

Problem Service action

System unable to find GPU

Fence errors in the operating system log

1. Verify that the GPU is properly seated.

2. Verify that the drivers for the GPU are installed.

3. Verify that the most recent firmware is installed on

the system. Otherwise, install the most recent firmware if it is not already installed.

4. Restart the system.

5. If the GPU is still missing, replace the following

items, one at a time, until the problem is resolved: Note: Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.

a. GPU b. System processor modules c. System backplane

1. Restart the system. Do fence errors continue to be

logged in the operating system log?

v Yes: Continue with the next step. v No: This ends the procedure.

2. Does NPU chip 0 appear in the fence error log entry?

v Yes: Continue with the next step. v No: Go to step 4.

3. Replace the following items, one at a time, until the

problem is resolved: Note: Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.

a. CPU 1 b. GPU 2 c. GPU 1 d. System backplane

This ends the procedure.

4. Does NPU chip 1 appear in the fence error log entry?

v Yes: Continue with the next step. v No: Go to “Contacting IBM service and support”

on page 110. This ends the procedure.

5. Replace the following items, one at a time, until the

problem is resolved: Note: Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.

a. CPU 2 b. GPU 4 c. GPU 3 d. System backplane

This ends the procedure.

18 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 35

Table 5. GPU problems and service actions for the 8335-GTB (continued)

Problem Service action

GPU stops working suddenly

Other problems For information about adapter diagnostics, see

1. If the system was recently installed, moved, serviced,

or upgraded, verify that the GPU is seated properly.

2. Inspect the GPU and verify that it is not physically

damaged.

3. If the GPU is still not working, replace the following

items, one at a time, until the problem is resolved: Note: Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.

a. GPU b. System processor modules c. System backplane

Supporting diagnostics. For information about adapter user information, see “User guides for GPUs and PCIe adapters” on page 25.

Resolving an NVMe Flash adapter problem

Learn about the possible problems and service actions that you can perform to resolve a Non-Volatile Memory Express (NVMe) Flash adapter problem.

If you suspect a problem with a PCIe3 1.92 TB CAPI NVMe Flash accelerator adapter (FC EJ1K; CCIN 58CD), see PCIe3 1.92 TB CAPI NVMe Flash Accelerator Adapter (FC EJ1K; CCIN 58CD).

If you suspect a problem with an NVMe Flash adapter, use the following table to determine the service action to perform.

Note: To determine the location of the NVMe Flash adapter, see “Identifying the location of the NVMe Flash adapter” on page 23.

Table 6. NVMe Flash adapter problems and service actions

Problem Service action

System is unable to find the NVMe Flash adapter

1. If the NVMe Flash adapter has an amber LED that is flashing or is on solid, replace the

adapter. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure. Important: Before you remove an NVMe Flash adapter, ensure that you back up all data on the adapter or the array that contains the adapter. After you replace the adapter, restore the data.

2. If the system was recently installed, moved, serviced, or upgraded, verify that the NVMe

Flash adapter is seated and installed properly.

3. Verify that the NVMe Flash adapter is compatible with the system.

4. Verify that the most recent firmware is installed on the system. Otherwise, install the

most recent firmware if it is not already installed.

5. Replace the NVMe Flash adapter. Go to “8335-GCA and 8335-GTA locations” on page 111

to identify the physical location and removal and replacement procedure. Important: Before you remove an NVMe Flash adapter, ensure that you back up all data on the adapter or the array that contains the adapter. After you replace the adapter, restore the data.

Beginning troubleshooting and problem analysis 19

Page 36

Table 6. NVMe Flash adapter problems and service actions (continued)

Problem Service action

NVMe Flash adapter stops working suddenly

Maximum write capability of an NVMe Flash adapter is depleted

Resolving a storage device problem

Learn about the possible problems and service actions that you can perform to resolve a storage device problem.

Note: To determine the location of the storage device, see “Identifying the location of the storage device” on page 24.

20 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 37

Table 7. Storage device problems and service actions

Problem Service action

System is unable to find a storage device that is at the front of the system

System is unable to find a storage device that is at the rear of the system (8348-21C only)

Drive stops working suddenly

Other problems Check the messages and resolve any other problems that

1. If the system was recently installed, moved, serviced,

or upgraded, verify that the device is seated and installed properly.

2. Verify that the device is compatible with your system.

3. Verify that all internal cables are properly seated and

are not physically damaged.

4. Verify that the most recent firmware is installed on

the system. Otherwise, install the most recent firmware if it is not already installed.

5. Replace the drive.

6. If your system is a 8348-21C, replace the system

backplane or the storage mezzanine card.

7. Replace the cable.

8. If you have a RAID adapter installed, replace it.

If the system is unable to find one storage device that is at the rear of the system, replace the following items, one at a time until the problem is resolved:

v Drive v Drive tray v System backplane

If the system is unable to find more than one storage device that is at the rear of the system, replace the following items, one at a time until the problem is resolved:

v Drive tray v System backplane

1. Verify that all internal cables are properly seated and

are not physically damaged.

2. Check the system logs to verify whether the system

detected a problem.

3. Replace the drive.

4. If your system is a 8348-21C, replace the system

backplane or the storage mezzanine card.

5. Replace the cable.

6. If you have a RAID adapter that is installed, replace

it.

were detected. Then, test the drive again. If the drive continues not to function, refer to the documentation for the drive.

Identifying the location of the PCIe adapter by using the slot number

The error message provides information to help you to determine the location of the PCIe adapter.

For example, the log might contain an error message similar to the following text:

[131779.752714] EEH: PHB#0 failure detected, location: Slot5

Beginning troubleshooting and problem analysis 21

Page 38

Use the following table to map the slot number information in the operating system log to the PCIe adapter description and service action.

Table 8. Slot numbers, adapter descriptions, and service action for the 8335-GCA or 8335-GTA.

Slot information from the log PCIe adapter description Service action

Slot1 PCIe adapter 1 Replace the PCIe adapter indicated in Slot2 PCIe adapter 2 Slot3 PCIe adapter 3 Slot4 PCIe adapter 4 Slot5 PCIe adapter 5

Table 9. Slot numbers, adapter descriptions, and service action for the 8335-GTB

Slot information from the log PCIe adapter description Service action

Slot1 PCIe adapter 1 Replace the PCIe adapter indicated in Slot2 PCIe adapter 2 Slot3 PCIe adapter 3

Table 10. Slot numbers, adapter descriptions, and service action for the 8348-21C.

Slot information from the log PCIe adapter description Service action

Slot1 PCIe adapter 1 Replace the PCIe adapter indicated in Slot2 PCIe adapter 2 Slot3 PCIe adapter 3 Slot4 PCIe adapter 4

the PCIe adapter description column. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.

the PCIe adapter description column. Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.

the PCIe adapter description column. Go to “8348-21C locations” on page 133 to identify the physical location and the removal and replacement procedure.

Identifying the location of the GPU

The error message provides information to help you to determine the location of the graphics processing unit (GPU).

On an 8335-GCA or 8335-GTA system, the log might contain an error message similar to the following text:

EEH: PHB#0 failure detected, location: Slot5

On an 8335-GTB system, the log might contain an error message similar to the following text:

EEH: PHB#0 failure detected, location: GPU1

If you have an 8335-GTB system with Red Hat Enterprise Linux 7.4 or later, and if you get an error message with only PCI bus information (for example, 0002:01:00.0), you can determine the GPU slot information by using the lshw command. Complete the following steps:

1. Record the PCI bus information that is in the error message.

2. Log in to the operating system with root authority.

3. Type the following command and press Enter:

lshw -class display

4. Determine the GPU slot that is associated with the PCI bus information that you recorded in step 1.

22 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 39

Use the following table to map the slot or GPU number information in the operating system log to the GPU description and service action. This ends the procedure.

Table 11. Slot numbers, GPU descriptions, and service action for the 8335-GCA or 8335-GTA

Slot number information from the log GPU description Service action

Slot5 GPU 2 Replace the GPU indicated in the Slot2 GPU 1

Table 12. GPU numbers, GPU descriptions, and service action for the 8335-GTB

GPU number information from the log GPU description Service action

GPU1 GPU 1 Replace the GPU indicated in the GPU2 GPU 2 GPU3 GPU 3 GPU4 GPU 4

GPU description column. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.

GPU description column. Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.

Identifying the location of the NVMe Flash adapter

Use this procedure to identify the location of a Non-Volatile Memory Express (NVMe) Flash adapter.

1. Does the operating system log contain the slot number? For example, the log might contain an error

message similar to the following text:

[131779.752714] EEH: PHB#0 failure detected, location: Slot1

If Then Yes: If your system is an 8335-GCA, use Table 13 on page 24 to map the slot number

information in the operating system log to the PCIe adapter description and service action. If your system is an 8335-GTB, use Table 14 on page 24 to map the slot number information in the operating system log to the PCIe adapter description and service action. This ends

the procedure.

No: Continue with the next step.

2. Locate the NVMe Flash adapter by using the PCI address: a. The operating system log contains information about the NVMe Flash adapter in the form of a PCI

address. Record the PCI address information for the NVMe Flash adapter that has failed. For example, in the operating system log message nvme 0006:01:00.0: Failed status: ffffffff, reset controller, the PCI address of the failing NVMe Flash adapter is 0006:01:00.0.

b. At the command line, type lscfg -vl pciaddress, where pciaddress is the NVMe Flash adapter

information that you recorded in step 2.a. Then, press Enter.

c. Record the slot number information that is in the location code field. d. If your system is an 8335-GCA, use Table 13 on page 24 to map the slot number information to the

PCIe adapter description and service action. If your system is an 8335-GTB, use Table 14 on page 24 to map the slot number information to the PCIe adapter description and service action. This

ends the procedure.

Beginning troubleshooting and problem analysis 23

Page 40

Table 13. Slot numbers, adapter descriptions, and service action for the 8335-GCA

Slot information from the log PCIe adapter description Service action

Slot1 PCIe adapter 1 Replace the NVMe Flash adapter indicated in the PCIe Slot3 PCIe adapter 3 Slot4 PCIe adapter 4

Table 14. Slot numbers, adapter descriptions, and service action for the 8335-GTB

Slot information from the log PCIe adapter description Service action

Slot1 PCIe adapter 1 Replace the NVMe Flash adapter indicated in the PCIe Slot2 PCIe adapter 2 Slot3 PCIe adapter 3

adapter description column. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and the removal and replacement procedure.

adapter description column. Go to “8335-GTB locations” on page 121 to identify the physical location and the removal and replacement procedure.

Identifying the location of the storage device

Use this procedure to identify the location of a storage device.

1. Is there a disk drive or solid-state drive with an amber fault LED turned on solid?

If Then Yes: Continue with step 2. No: Continue with step 3.

2. Replace the disk drive or solid-state drive.

v If your system is an 8335-GCA or 8335-GTA, go to “8335-GCA and 8335-GTA locations” on page 111

to identify the removal and replacement procedure. This ends the procedure.

v If your system is an 8335-GTB, go to “8335-GTB locations” on page 121 to identify the removal and

replacement procedure. This ends the procedure.

v If your system is an 8348-21C, go to “8348-21C locations” on page 133 to identify the removal and

replacement procedure. This ends the procedure.

3. Is the system an 8335-GCA, 8335-GTA, or 8335-GTB?

If Then Yes: Continue with step 4. No: Continue with step 5.

4. The storage device location is determined in the drive removal and replacement procedures for your

system. Use the following table to find the correct removal and replacement procedure. This ends the

procedure.

Table 15. Drive removal and replacement procedures

System Drive removal and replacement procedures

8335-GCA or 8335-GTA See Removing and replacing a disk drive in the

8335-GCA or 8335-GTA with the system power turned on.

8335-GTB See Removing and replacing a disk drive in the

8335-GTB.

5. The system is an 8348-21C. Are the devices controlled by a RAID adapter?

24 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 41

If Then Yes: Continue with step 6. No: Continue with step 9.

6. To locate the device by using the identify LED, complete the following steps: a. The operating system log contains information about the device in the form sdx, where x is the

letter associated with the drive that failed. Record the sdx information for the device that failed. For example, the failing device in the following operating system log is sdb:[ 2614.698832]

blk_update_request: I/O error, dev sdb, sector 131072

b. At the command prompt, type hdparm -i /dev/sdx, where sdx is the device information recorded

in step 6a. Then, press Enter.

c. Record the serial number of the device. d. At the command prompt, type arcconf getconfig 1 PD and press Enter. Find the reported channel

and device numbers for the device that has the same serial number that you recorded in the previous step. Record the reported channel and device numbers.

e. At the command prompt, type arcconf identify 1 device x y start, where x is the reported

channel number and y is the reported device number that you recorded in the previous step. Then, press Enter.

Is the identify LED for one of the devices flashing?

If Then Yes: Continue with the next step. No: Continue with step 9.

7. Replace the device with the flashing identify LED. Go to “8348-21C locations” on page 133 to identify the removal and replacement procedure. After you have replaced the device, continue with the next step.

8. At the command prompt, type arcconf identify 1 device x y stop, where x is the reported channel number and y is the reported device number that you recorded in step 6d. Then, press Enter. This

ends the procedure.

9. To locate the device by using the device serial number, complete the following steps: a. The operating system log contains information about the device in the form sdx, where x is the

letter associated with the drive that failed. Record the sdx information for the device that failed. For example, the failing device in the following operating system log is sdb:[ 2614.698832]

blk_update_request: I/O error, dev sdb, sector 131072

b. At the command prompt, type hdparm -i /dev/sdx, where sdx is the device information recorded

in step 9a. Then, press Enter.

c. Record the serial number of the device. d. Power off the system. Remove one device at a time until you identify the device with the serial

number identified in step 9c. Replace only the device with the matching serial number. Reinstall the other devices. Go to “8348-21C locations” on page 133 to identify the removal and replacement procedure. This ends the procedure.

User guides for GPUs and PCIe adapters

Use this information to find the user guide for your graphics processing unit (GPU) or PCIe adapter.

Use the following table to find the user guide for the GPU or PCIe adapter that you are using.

Beginning troubleshooting and problem analysis 25

Page 42

Table 16. GPU and PCIe adapter user guides

Name User guide

Broadcom Broadcom website (http://www.broadcom.com) Emulex Emulex website (http://www.emulex.com/products/

ethernet-networking-storage-connectivity/ethernetnetworking-adapters/ibm-branded/selection-guide/)

Marvell Marvell website (http://www.marvell.com/storage/

system-solutions/sata-controllers/)

Mellanox Mellanox Technologies website (http://

mymellanox.force.com/support/VF_SerialSearch) NVIDIA NVIDIA website (http://www.nvidia.com) PMC-Sierra PMC-Sierra website (http://www.nvidia.com) QLogic QLogic website (http://driverdownloads.qlogic.com/

QLogicDriverDownloads_UI/IBM_Search.aspx)

Resolving an over temperature problem for a water-cooled 8335-GTB system

Learn how to identify the service action that is needed to resolve an over temperature problem.

1. Go to Water cooling system specification and requirements. Are all of the requirements for

water-cooled systems met?

Note: For information specific to the 8335-GTB, see Model 8335-GTB water cooling option (Feature code E2RD).

If Then Yes: Continue with the next step. No: Work with the customer to ensure that all of the requirements for water-cooled systems are

met. This ends the procedure.

2. Is the room temperature less than 40°C (104°F)?

If Then Yes: Continue with the next step. No: Notify the customer. The customer must bring the room temperature within normal range.

Continue with the next step.

3. Ensure that the following requirements are met: a. The quick-connects between the 8335-GTB system and the water manifold are mated and

connected to the proper circuits of the manifold. The supply hose must be connected to the supply manifold circuit, which is the manifold circuit that is located toward the inside of the rack. The return hose must be connected to the return manifold circuit, which is the manifold circuit that is located toward the outside of the rack.

b. The facility water supply hose is properly connected to the supply hose on the manifold and the

return hose on the manifold is properly connected to the facility water return hose. v The ball valves that connect the facility water supply hose to the manifold supply hose and the

facility water return hose to the manifold return hose are open. For more information about connecting the facility water hoses to the manifold hoses, see Replacing the water manifold in the 8335-GTB.

26 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 43

v All of the valves that might restrict the flow of water through the hoses are open in the facility

water system.

v The pumping unit of the facility water system is on and does not have errors.

c. The facility water system is supplying water at the required temperature and flow. For

instructions, see Model 8335-GTB water cooling option (Feature code E2RD).

Does the problem persist?

If Then Yes: Continue with the next step.

Note: Steps 1- 3 resolve most problems. Ensure that you carefully check steps 1 - 3 before

you continue with the next step.

No: This ends the procedure.

4. Is a processor over heating, but the other processor and the graphics processing units (GPUs) are not

over heating?

If Then Yes: Check the thermal interface material (TIM) between the cold plate and the processor that is

over heating. Go to Removing a system processor module from a water-cooled 8335-GTB system and complete the steps to lift the cold plate off the processor. If the TIM pad is damaged, replace the TIM pad. To replace a TIM pad, go to Replacing a system processor module in a water-cooled 8335-GTB system and complete the steps for removing and installing a new TIM pad. This ends the procedure.

No: Continue with the next step.

5. Is a GPU over heating, but the other GPUs and the processors are not over heating?

If Then Yes: Replace the thermal interface material (TIM) between the cold plate and the GPU that is

over heating. Go to Removing the graphics processing unit from a water-cooled 8335-GTB system and complete the steps to lift the cold plate off the GPU. Then, go to Replacing the graphics processing unit in a water-cooled 8335-GTB system and complete the steps for installing a new TIM pad. If the problem is not resolved, replace the GPU. For instructions about replacing a GPU, see Removing and replacing a graphics processing unit in the 8335-GTB. This ends the procedure.

No: Continue with the next step.

6. Replace the cold plates. For instructions about how to replace the cold plates, see Removing and

replacing the cold plates in the 8335-GTB. Does the problem persist?

If Then Yes: Go to “Contacting IBM service and support” on page 110. This ends the procedure. No: This ends the procedure.

Identifying a service action

Use the following procedures to help you identify the service action that is needed.

Identifying a service action by using system event logs

Use the Intelligent Platform Management Interface (IPMI) program to examine system event logs (SELs) to identify a service action.

1. Use the ipmitool command to examine SELs.

Beginning troubleshooting and problem analysis 27

Page 44

v To list SELs by using an in-band network, use the following command:

ipmitool sel elist

v To list SELs remotely over the LAN, use the following command:

ipmitool -I lanplus -U <username> -P <password> -H <BMC IP addres or BMC hostname> sel elist

2. Scan the SELs for an event with the value OEM record de. Did you find a SEL with the value OEM

record de?

If Then Yes: Continue with the next step. No Go to step 4 on page 29.

3. The OEM record de specific log information is indicated by the rightmost digits of the SEL with the

value OEM record de. Use Table 17 to determine the service action to perform.

Table 17. OEM record de specific log information and service action

OEM record de specific log information Service action

00xxxxxxxxxx Go to Getting fixes and update the system firmware to

the most recent level of firmware that is available. If this SEL event continues to be logged, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page 110.

01xxxxxxxxxx Go to the “EPUB_PRC_FIND_DECONFIGURE_PART

isolation procedure” on page 96.

04xxxxxxxxxx Go to the “EPUB_PRC_SP_CODE isolation procedure”

on page 97.

05xxxxxxxxxx Go to the “EPUB_PRC_PHYP_CODE isolation

procedure” on page 97.

08xxxxxxxxxx Go to the “EPUB_PRC_ALL_PROCS isolation procedure”

on page 98.

09xxxxxxxxxx Go to the “EPUB_PRC_ALL_MEMCRDS isolation

procedure” on page 98.

0Axxxxxxxxxx Go to Getting fixes and update the system firmware to

10xxxxxxxxxx Go to the “EPUB_PRC_LVL_SUPPORT isolation

procedure” on page 99.

16xxxxxxxxxx Go to Getting fixes and update the system firmware to

1Cxxxxxxxxxx Go to Getting fixes and update the system firmware to

22xxxxxxxxxx Go to the “EPUB_PRC_MEMORY_PLUGGING_ERROR

isolation procedure” on page 100.

2Dxxxxxxxxxx Go to the “EPUB_PRC_FSI_PATH isolation procedure”

on page 100.

28 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 45

Table 17. OEM record de specific log information and service action (continued)

OEM record de specific log information Service action

30xxxxxxxxxx Go to the “EPUB_PRC_PROC_AB_BUS isolation

procedure” on page 101.

31xxxxxxxxxx Go to the “EPUB_PRC_PROC_XYZ_BUS isolation

procedure” on page 101.

34xxxxxxxxxx Go to Getting fixes and update the system firmware to

37xxxxxxxxxx Go to the “EPUB_PRC_EIBUS_ERROR isolation

procedure” on page 102.

3Fxxxxxxxxxx Go to the “EPUB_PRC_POWER_ERROR isolation

procedure” on page 103.

4Dxxxxxxxxxx Go to Getting fixes and update the system firmware to

4Fxxxxxxxxxx Go to the “EPUB_PRC_MEMORY_UE isolation

procedure” on page 104.

55xxxxxxxxxx Go to the “EPUB_PRC_HB_CODE isolation procedure”

on page 104.

56xxxxxxxxxx Go to the “EPUB_PRC_TOD_CLOCK_ERR isolation

procedure” on page 106.

5Cxxxxxxxxxx Go to the “EPUB_PRC_COOLING_SYSTEM_ERR

isolation procedure” on page 106.

5Exxxxxxxxxx Go to the “EPUB_PRC_GPU_ISOLATION_PROCEDURE

isolation procedure” on page 107.

This ends the procedure.

4. Scan the SELs for an event with the value OEM record df. Did you find a SEL with the value OEM

record df?

If Then Yes: Continue with the next step. No Go to step 10 on page 31.

5. One or more events might be logged around the same time as the event with the value OEM record

df. These events require a service action if they meet the following criteria:

v A service action keyword is present. For a list of service action keywords, see “Identifying service

action keywords in system event logs” on page 36.

v Asserted is in the description. v OEM record is not in the description. v The event has a time stamp in close proximity to the time stamp of the event with the value OEM

record df.

6. Did you find any SEL events that require a service action as defined in step 5?

If Then Yes: Continue with the next step.

Beginning troubleshooting and problem analysis 29

Page 46

If Then No: Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and

support” on page 110.

7. Did you find only one SEL event that requires a service action as defined in step 5 on page 29?

If Then Yes: Continue with the next step. No: Go to step 9.

8. Record the SEL record ID for the event you identified in step 5 on page 29. The SEL record ID is

indicated by the leftmost digits of the SEL. Use the ipmitool command to display the SEL details. v To display SEL details by using an in-band network, use the following command:

ipmitool sel get <SEL record ID>

Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.

v To display SEL details remotely over the LAN, use the following command:

ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel get <SEL record ID>

Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.

The sensor ID field contains sensor information in the format sensor name (sensor ID). Record the sensor name, sensor ID, and event description. Then, use the following information to determine the service action to perform:

v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor

and event information for the 8335-GCA and 8335-GTA” on page 37.

v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event

information for the 8335-GTB” on page 57.

v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event

information for the 8348-21C” on page 78.

This ends the procedure.

9. You identified more than one event in step 5 on page 29. The service actions for all of the events that

were identified in step 5 on page 29 must be performed to successfully complete the repair. Record the SEL record IDs for the events that you identified in step 5 on page 29. The SEL record ID is indicated by the leftmost digits of the SEL. Use the ipmitool command to display SEL details for each SEL record ID that you recorded.

v To display SEL details by using an in-band network, use the following command:

ipmitool sel get <SEL record ID>

Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.

v To display SEL details remotely over the LAN, use the following command:

ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel get <SEL record ID>

Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.

The sensor ID field contains sensor information in the format sensor name (sensor ID). Record the sensor name, sensor ID, and event description. Then, use this information to determine the service action to perform:

v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor

and event information for the 8335-GCA and 8335-GTA” on page 37.

30 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 47

v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event

information for the 8335-GTB” on page 57.

v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event

information for the 8348-21C” on page 78.

This ends the procedure.

10. Scan the SEL for an event with the value OEM record c0.

11. Did you find an event with the value OEM record c0?

If Then Yes: Continue with the next step. No: Go to step 13 on page 35.

12. The OEM record c0 specific log information is indicated by the rightmost digits of the SEL with the

value OEM record c0. If your system is an 8335-GCA or 8335-GTA, use Table 18 to determine the service action to perform. If your system is an 8335-GTB, use Table 19 on page 32 to determine the service action to perform. If your system is an 8348-21C, use Table 20 on page 34 to determine the service action to perform.

Table 18. OEM record c0 specific log information, description, and service action for an 8335-GCA or 8335-GTA

OEM record c0 specific log information Description Service action

320a01xxxxxx Phy read failure If you are viewing this event from 320a02xxxxxx Phy speed and duplex failure

320exxxxxxxx OCC reset required This event is for information only. No

3a0400xxxxxx Chassis soft power off A user initiated power off request 3a0402xxxxxx Chassis soft reboot

3a0701xxxxxx Request for PNOR access This event is for information only. No 3a0702xxxxxx Release of PNOR access 3a1100xxxxxx Fan thread stopped 3a1101xxxxxx Fan thread started 3a1503xxxxxx Primary side boot failed Go to “Resolving a system firmware

3a1504xxxxxx Golden side boot failed Go to “Resolving a system firmware

3a1601xxxxxx Fan 1 failure Replace Fan 1. Go to “8335-GCA and

3a1602xxxxxx Fan 2 failure Replace Fan 2. Go to “8335-GCA and

the BMC, the missing or defective cable is now operational and no service action is required. Otherwise, replace the missing or failed LAN cable that attaches the console to the system.

service action is required.

occurred. No service action is required.

service action is required.

boot failure” on page 4.

8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.

Beginning troubleshooting and problem analysis 31

Page 48

Table 18. OEM record c0 specific log information, description, and service action for an 8335-GCA or 8335-GTA (continued)

OEM record c0 specific log information Description Service action

3a1603xxxxxx Fan 3 failure Replace Fan 3. Go to “8335-GCA and

8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.

3a1604xxxxxx Fan 4 failure Replace Fan 4. Go to “8335-GCA and

8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.

3a260xyyyyyy, where x = 1, 2, or 3 System shut down due to one or

more missing or failed fans

3a2604yyyyyy All of the fans are missing or failed Ensure that the fan power cable and

The OEM record c0 specific log information is 3a260xyyyyyy, where x is the number of fans that were missing or failed when the system was shut down. The system cannot be powered on with missing fans. If any SEL events were logged with OEM record c0 specific log information 3a16xxxxxxxx, complete the service action indicated in this table. Otherwise, replace the fans, one at a time, until the problem is resolved. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.

the disk and fan signal cable are seated properly. If the problem persists, replace the following items, one at a time, until the problem is resolved: Note: Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.

v Power riser with time-of-day

battery slot

v Fan power cable v Disk and fan signal cable v Disk drive and fan card

Table 19. OEM record c0 specific log information, description, and service action for an 8335-GTB

OEM record c0 specific log information Description Service action

320a01xxxxxx Phy read failure If you are viewing this event from 320a02xxxxxx Phy speed and duplex failure

the BMC, the missing or defective cable is now operational and no service action is required. Otherwise, replace the missing or failed LAN cable that attaches the console to the system.

32 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 49

Table 19. OEM record c0 specific log information, description, and service action for an 8335-GTB (continued)

OEM record c0 specific log information Description Service action

320exxxxxxxx OCC reset required This event is for information only. No

service action is required. 3a0400xxxxxx Chassis soft power off A user initiated power off request 3a0402xxxxxx Chassis soft reboot

occurred. No service action is

required. 3a0701xxxxxx Request for PNOR access This event is for information only. No 3a0702xxxxxx Release of PNOR access

service action is required.

3a1100xxxxxx Fan thread stopped 3a1101xxxxxx Fan thread started 3a1503xxxxxx Primary side boot failed Go to “Resolving a system firmware

boot failure” on page 4. 3a1504xxxxxx Golden side boot failed Go to “Resolving a system firmware

boot failure” on page 4. 3a1601xxxxxx Fan 1 failure Replace Fan 1. Go to “8335-GTB

locations” on page 121 to identify the

physical location and removal and

replacement procedure. 3a1602xxxxxx Fan 2 failure Replace Fan 2. Go to “8335-GTB

locations” on page 121 to identify the

physical location and removal and

replacement procedure. 3a1603xxxxxx Fan 3 failure Replace Fan 3. Go to “8335-GTB

locations” on page 121 to identify the

physical location and removal and

replacement procedure. 3a1604xxxxxx Fan 4 failure Replace Fan 4. Go to “8335-GTB

locations” on page 121 to identify the

physical location and removal and

replacement procedure. 3a2600xxxxxx The water-cooled system shut down

due to too many processor core sensors reading a temperature at or above the maximum temperature that is allowed.

3a260xyyyyyy, where x = 1, 2, or 3 System shut down due to one or

more missing or failed fan

At least one processor is over

heating. Go to “Resolving an over

temperature problem for a

water-cooled 8335-GTB system” on

page 26.

The OEM record c0 specific log

information is 3a260xyyyyyy where x

is the number of fans that were

missing or failed when the system

was shut down. The system cannot

be powered on with missing fans. If

any SEL events were logged with

OEM record c0 specific log

information 3a16xxxxxxxx, complete

the service action indicated in this

table. Otherwise, replace the fans,

one at a time, until the problem is

resolved. Go to “8335-GTB locations”

on page 121 to identify the physical

location and removal and

replacement procedure.

Beginning troubleshooting and problem analysis 33

Page 50

Table 19. OEM record c0 specific log information, description, and service action for an 8335-GTB (continued)

OEM record c0 specific log information Description Service action

3a2604yyyyyy All of the fans are missing or failed Ensure that the fan power cable and

the disk and fan signal cable are seated properly. If the problem persists, replace the following items, one at a time, until the problem is resolved: Note: Go to “8335-GTB locations” on page 121 to identify the physical location and removal and replacement procedure.

v Power riser with time-of-day

battery slot

v Fan power cable v Disk and fan signal cable v Disk drive and fan card

Table 20. OEM record c0 specific log information, description, and service action for an 8348-21C

OEM record c0 specific log information Description Service action

320a01xxxxxx Phy read failure If you are viewing this event from 320a02xxxxxx Phy speed and duplex failure

320exxxxxxxx OCC reset required This event is for information only. No

3a0400xxxxxx Chassis soft power off A user initiated power off request 3a0402xxxxxx Chassis soft reboot

3a1504xxxxxx Golden side boot failed Go to “Resolving a system firmware

3a1601xxxxxx Fan 1 failure Replace Fan 1. Go to “8348-21C

3a1602xxxxxx Fan 2 failure Replace Fan 2. Go to “8348-21C

the BMC, the missing or defective cable is now operational and no service action is required. Otherwise, replace the missing or failed LAN cable that attaches the console to the system.

service action is required.

occurred. No service action is required.

service action is required.

boot failure” on page 4.

locations” on page 133 to identify the physical location and removal and replacement procedure.

34 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 51

Table 20. OEM record c0 specific log information, description, and service action for an 8348-21C (continued)

OEM record c0 specific log information Description Service action

3a1603xxxxxx Fan 3 failure Replace Fan 3. Go to “8348-21C

locations” on page 133 to identify the

physical location and removal and

replacement procedure. 3a1604xxxxxx Fan 4 failure Replace Fan 4. Go to “8348-21C

locations” on page 133 to identify the

physical location and removal and

replacement procedure. 3a1605xxxxxx Fan 5 failure Replace Fan 5. Go to “8348-21C

locations” on page 133 to identify the

physical location and removal and

replacement procedure. 3a260xyyyyyy, where x = 1, 2, 3, or 4 System shut down due to one or

more missing or failed fans

3a2605yyyyyy All of the fans are missing or failed Replace the disk drive backplane. Go

The OEM record c0 specific log

information is 3a260xyyyyyy, where x

is the number of fans that were

missing or failed when the system

was shut down. The system cannot

be powered on with missing or failed

fans. If any SEL events were logged

with OEM record c0 specific log

information 3a16xxxxxxxx, complete

the service action indicated in this

table. Otherwise, replace the fans,

one at a time, until the problem is

resolved. Go to “8348-21C locations”

on page 133 to identify the physical

location and removal and

replacement procedure.

to “8348-21C locations” on page 133

to identify the physical location and

removal and replacement procedure.

13. One or more SEL events might require a service action. These events require a service action if they

meet the following criteria: v A service action keyword is present. For a list of service action keywords, see “Identifying service

action keywords in system event logs” on page 36.

v Asserted is in the description. v OEM record is not in the description.

14. Did you find one or more SEL events that require a service action as defined in step 13?

If Then Yes: Continue with the next step. No: This ends the procedure.

15. The service actions for all of the events that were identified in step 13 must be performed to

successfully complete the repair. Record the SEL record IDs for the events that you identified in step

13. The SEL record ID is indicated by the leftmost digits of the SEL. Use the ipmitool command to display SEL details for each SEL record ID that you recorded.

v To display SEL details by using an in-band network, use the following command:

ipmitool sel get <SEL record ID>

Beginning troubleshooting and problem analysis 35

Page 52

Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.

v To display SEL details remotely over the LAN, use the following command:

ipmitool -I lanplus -U <username> -P <password> -H <BMC IP address or BMC hostname> sel get <SEL record ID>

Note: The SEL record ID must be entered in hexadecimal format. For example: 0x1a.

v If your system is an 8335-GCA or 8335-GTA, go to “Identifying a service action by using sensor

and event information for the 8335-GCA and 8335-GTA” on page 37.

v If your system is an 8335-GTB, go to “Identifying a service action by using sensor and event

information for the 8335-GTB” on page 57.

v If your system is an 8348-21C, go to “Identifying a service action by using sensor and event

information for the 8348-21C” on page 78.

This ends the procedure.

Identifying service action keywords in system event logs

System event logs (SELs) that have Asserted and any of the keywords indicated below in the description require a service action.

Temperature, voltage, and current service action keywords

v Transition to Critical from Less Severe v Transition to Critical from Non-recoverable v Transition to Non-recoverable

Fan service action keywords

v Transition to Critical from Less Severe v Transition to Non-recoverable from Less Severe v Transition to Critical from Non-recoverable v Device Removed / Device Absent v Transition to degraded v Install error v Redundancy lost v Non-redundant insufficient resources

Memory service action keywords

v Configuration Error v Transition to Non-recoverable v Predictive Failure

Processor service action keywords

v IERR v Transition to Non-recoverable v Predictive Failure

Power supply and All PGood service action keywords

v Power Supply Failure Detected v Predictive Failure

36 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 53

v Power Supply Input Lost or AC DC v Power Supply Input Lost Or Out of Range v Power Supply Input Out of Range But Present v Configuration Error v Transition to Critical from Less Severe v Transition to Non-recoverable from Less Severe v Transition to Critical from Non-recoverable v Transition to Non-recoverable v Redundancy lost v Non-redundant insufficient resources v AC Lost v Soft Power Control Failure v Power Unit Failure Detected v Predictive Failure

System firmware service action keywords

v System Firmware Error v System Firmware Hang v Transition to Critical from Less Severe v Transition to Non-recoverable from Less Severe v Transition to Critical from Non-recoverable v Transition to Non-recoverable

System ACPI power state service action keywords

v Unknown

Watchdog service action keywords

v Hard Reset v Power Down v Power Cycle v Timer Interrupt

System event service action keywords

v Undetermined system hardware failure

OS boot service action keywords

v Installation aborted v Installation failed

Identifying a service action by using sensor and event information

You can use sensor and event information from the system event log (SEL) to determine a service action.

Identifying a service action by using sensor and event information for the 8335-GCA and 8335-GTA

You can use the sensor and event information from the system event log (SEL) to determine a service action to perform for the IBM Power®System S822LC (8335-GCA and 8335-GTA).

If you have not done so already, complete “Identifying a service action by using system event logs” on page 27. Then, use the following table to determine the service action to perform.

Beginning troubleshooting and problem analysis 37

Page 54

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA

Sensor name (Sensor ID) Event description Service action

Watchdog (0x00)

Host Status (0x04) Unknown Go to Getting fixes and update the

v Timer Expired v Reserved1 v Reserved2 v Reserved3 v Reserved4

v Hard Reset v Power Down v Power Cycle v Timer Interrupt

v S0/Go “Working” v S1 “Sleeping with system h/w &

processor context maintained”

v S2 “sleeping, processor context

lost”

v S3 “sleeping, processor & h/w

context lost, memory retained”

v S4 “non-volatile sleep / suspend-to

disk”

v S5 / G2: “soft-off” v S4 / S5: “soft-off” v G3 mechanical Off v Sleeping in an S1/S2/S3 State v G1: Sleeping v S5: entered by override v Legacy ON state v Legacy OFF state

No service action is required.

SEL events with OEM record c0 | 000e000 | 3a150xxxxxxx indicate that a boot failed. Search for boot failure SEL events that have a time stamp in close proximity to the time stamp of this SEL event. If events exist, go to “Resolving a system firmware boot failure” on page 4. If there are no boot failure SEL events and the system booted correctly, no service action is required.

system firmware to the most recent level of firmware that is available. If this SEL event continues to be logged each time you power on the system, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page

110. No service action is required.

38 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 55

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

FW Boot Progress (0x05)

v System Firmware Error v System Firmware Hang

SEL events with OEM record c0 |

000e000 | 3a150xxxxxxx indicate that

a boot failed. Search for boot failure

SEL events that have a time stamp in

close proximity to the time stamp of

this SEL event. If events exist, go to

“Resolving a system firmware boot

failure” on page 4.

System Firmware Progress No service action is required.

v OCC 1 Active (0x08) v OCC 2 Active (0x09)

Device Disabled If the sensor name is OCC 1 Active,

replace CPU 1. If the sensor name is

OCC 2 Active, replace CPU 2. Go to

“8335-GCA and 8335-GTA locations”

on page 111 to identify the physical

location and removal and

replacement procedure.

v State Deasserted

No service action is required.

v Device Enabled

Ambient Temp (0x0A)

v Upper Critical - going low

No service action is required.

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Lower Critical - going low v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

Upper Critical - going high Ensure that the room temperature

meets the requirements that are

specified for the system. Ensure that

no obstructions are blocking air flow

to the system.

Beginning troubleshooting and problem analysis 39

Page 56

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v CPU1 Temp (0x0B) v CPU2 Temp (0x0D)

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical - going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Lower Critical - going low v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

No service action is required.

40 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 57

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v CPU Func 1 (0x0C) v CPU Func 2 (0x0E)

v IERR v Transition to Non-recoverable v Predictive Failure

If the sensor name is CPU Func 1,

replace CPU 1. If the sensor name is

CPU Func 2, replace CPU 2. Go to

“8335-GCA and 8335-GTA locations”

on page 111 to identify the physical

location and removal and

replacement procedure.

v Thermal Trip

No service action is required.

v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup

Initialization Failure

v Configuration Error v SMBIOS Uncorrectable CPU

Complex Error

v Processor Disabled v Terminator Presence Detected v Processor Automatically Throttled v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

Non-recoverable

v Processor Presence Detected v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from

More Severe

v Monitor v Informational

Beginning troubleshooting and problem analysis 41

Page 58

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

All PGood (0x1C)

v Interlock Power Down v Power Off Power Down v Power Cycle v 240VA Power Down

No service action is required.

v AC Lost v Soft Power Control Failure

v Power Unit Failure Detected v Predictive Failure

v Ensure that ac power is supplied

to the rack.

v Ensure that the system power

cords are plugged tightly into both the power supply and the rack power distribution unit (PDU) for both system power supplies.

v Ensure that the system was not

powered off.

v Ensure that ac power is supplied

to the rack.

v Ensure that the power supply

cords are plugged tightly into the power supplies and the rack PDU unit.

v Ensure that the system was not

powered off.

v Check for service action required

SEL events for the power supply sensor. If any exist, follow the service action that is specified in “Identifying a service action by using sensor and event information for the 8335-GCA and 8335-GTA” on page 37.

42 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 59

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v DIMM Func 1 (0x1E) v DIMM Func 2 (0x1F) v DIMM Func 3 (0x20) v DIMM Func 4 (0x21) v DIMM Func 5 (0x22) v DIMM Func 6 (0x23) v DIMM Func 7 (0x24) v DIMM Func 8 (0x25) v DIMM Func 9 (0x26) v DIMM Func 10 (0x27) v DIMM Func 11 (0x28) v DIMM Func 12 (0x29) v DIMM Func 13 (0x2A) v DIMM Func 14 (0x2B) v DIMM Func 15 (0x2C) v DIMM Func 16 (0x2D) v DIMM Func 17 (0x2E) v DIMM Func 18 (0x2F) v DIMM Func 19 (0x30) v DIMM Func 20 (0x31) v DIMM Func 21 (0x32) v DIMM Func 22 (0x33) v DIMM Func 23 (0x34) v DIMM Func 24 (0x35) v DIMM Func 25 (0x36) v DIMM Func 26 (0x37) v DIMM Func 27 (0x38) v DIMM Func 28 (0x39) v DIMM Func 29 (0x3A) v DIMM Func 30 (0x3B) v DIMM Func 31 (0x3C) v DIMM Func 32 (0x3D)

v Memory Device Disabled v Uncorrectable Memory Error v Memory Scrub Failed v State Deasserted v Device Disabled v Transition to Critical from Less

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

Non-recoverable

v Correctable Memory Error v Parity v Correctable Memory Error Logging

Limit Reached

v Memory Automatically Throttled v Critical Over temperature v Presence Detected v Spare v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from

More Severe

v Monitor v Informational

v Transition to Non-recoverable v Predictive Failure

No service action is required.

If the sensor name is DIMM Func 1,

replace DIMM 1. If the sensor name

is DIMM Func 2, replace DIMM 2.

And so on. Go to “8335-GCA and

8335-GTA locations” on page 111 to

identify the physical location and

removal and replacement procedure.

Beginning troubleshooting and problem analysis 43

Page 60

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

Configuration Error Complete the following steps:

1. If the sensor name is DIMM Func

1, ensure that DIMM 1 is seated properly. If the sensor name is DIMM Func 2, ensure that DIMM 2 is seated properly. And so on.

2. If you recently installed or

replaced memory DIMMs, ensure that the DIMMs are plugged in the correct memory slots.

3. If the sensor name is DIMM Func

1, replace DIMM 1. If the sensor name is DIMM Func 2, replace DIMM 2. And so on. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.

44 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 61

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v CPU Core Func 1 (0x3E) v CPU Core Func 2 (0x3F) v CPU Core Func 3 (0x40) v CPU Core Func 4 (0x41) v CPU Core Func 5 (0x42) v CPU Core Func 6 (0x43) v CPU Core Func 7 (0x44) v CPU Core Func 8 (0x45) v CPU Core Func 9 (0x46) v CPU Core Func 10 (0x47) v CPU Core Func 11 (0x48) v CPU Core Func 12 (0x49)

v IERR v Transition to Non-recoverable v Predictive Failure

v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup

Initialization Failure

v Configuration Error v SMBIOS Uncorrectable CPU

Complex Error

v Processor Disabled

Replace system processor CPU 1. Go

to “8335-GCA and 8335-GTA

locations” on page 111 to identify the

physical location and removal and

replacement procedure.

No service action is required.

v Terminator Presence Detected v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

Non-recoverable

v Thermal Trip v Processor Automatically Throttled v Processor Presence Detected v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from

More Severe

v Monitor v Informational

Beginning troubleshooting and problem analysis 45

Page 62

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v CPU Core Func 13 (0x4A) v CPU Core Func 14 (0x4B) v CPU Core Func 15 (0x4C) v CPU Core Func 16 (0x4D) v CPU Core Func 17 (0x4E) v CPU Core Func 18 (0x4F) v CPU Core Func 19 (0x50) v CPU Core Func 20 (0x51) v CPU Core Func 21 (0x52) v CPU Core Func 22 (0x53) v CPU Core Func 23 (0x54) v CPU Core Func 24 (0x55)

v IERR v Transition to Non-recoverable v Predictive Failure

v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup

Initialization Failure

v Configuration Error v SMBIOS Uncorrectable CPU

Complex Error

v Processor Disabled v Terminator Presence Detected v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

Non-recoverable

More Severe

v Monitor v Informational

Replace system processor CPU 2. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.

No service action is required.

46 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 63

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v Mem Buf Func 1 (0x56) v Mem Buf Func 2 (0x57) v Mem Buf Func 3 (0x58) v Mem Buf Func 4 (0x59) v Mem Buf Func 5 (0x5A) v Mem Buf Func 6 (0x5B) v Mem Buf Func 7 (0x5C) v Mem Buf Func 8 (0x5D)

v Uncorrectable Memory Error v Memory Device Disabled v State Deasserted v Device Disabled v Transition to Critical from Less

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

No service action is required.

Non-recoverable

v Correctable Memory Error v Parity v Memory Scrub Failed v Correctable Memory Error Logging

Limit Reached

More Severe

v Monitor v Informational

v Configuration Error v Transition to Non-recoverable v Predictive Failure

If the sensor name is Mem Buf Func

1, replace memory riser 1. If the

sensor name is Mem Buf Func 2,

replace memory riser 2. And so on.

Go to “8335-GCA and 8335-GTA

locations” on page 111 to identify the

physical location and removal and

replacement procedure. Boot Count (0x5F) None No service action is required. Motherboard Flt (0x60) State Deasserted No service action is required.

State Asserted Replace the system backplane. Go to

“8335-GCA and 8335-GTA locations”

on page 111 to identify the physical

location and removal and

replacement procedure.

Beginning troubleshooting and problem analysis 47

Page 64

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

System Event (0x61) Undetermined system hardware

failure

v System Reconfigured v OEM System boot event v Entry added to auxiliary log v PEF Action v Timestamp Clock Sync v Transition State Active v Transition State Idle v Transition State Busy

Activate Pwr Lt (0x62) None No service action is required.

v Ref Clock Fault (0x63) v PCI Clock Fault (0x64)

v State Deasserted v State Asserted

Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page

110. No service action is required.

No service action is required.

48 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 65

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v DIMM1 Temp (0x69) v DIMM2 Temp (0x6A) v DIMM3 Temp (0x6B) v DIMM4 Temp (0X6C) v DIMM5 Temp (0x6D) v DIMM6 Temp (0x6E) v DIMM7 Temp (0x6F) v DIMM8 Temp (0x70) v DIMM9 Temp (0x71) v DIMM10 Temp (0x72) v DIMM11 Temp (0x73) v DIMM12 Temp (0x74) v DIMM13 Temp (0x75) v DIMM14 Temp (0x76) v DIMM15 Temp (0x77)

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

No service action is required.

v DIMM16 Temp (0x78) v DIMM17 Temp (0x79) v DIMM18 Temp (0x7A) v DIMM19 Temp (0x7B) v DIMM20 Temp (0x7C) v DIMM21 Temp (0x7D) v DIMM22 Temp (0x7E) v DIMM23 Temp (0x7F) v DIMM24 Temp (0x80) v DIMM25 Temp (0x81) v DIMM26 Temp (0x82) v DIMM27 Temp (0x83) v DIMM28 Temp (0x84) v DIMM29 Temp (0x85) v DIMM30 Temp (0x86) v DIMM31 Temp (0x87) v DIMM32 Temp (0x88)

Beginning troubleshooting and problem analysis 49

Page 66

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v CPU Core Temp 1 (0x89) v CPU Core Temp 2 (0x8A) v CPU Core Temp 3 (0x8B) v CPU Core Temp 4 (0x8C) v CPU Core Temp 5 (0x8D) v CPU Core Temp 6 (0x8E) v CPU Core Temp 7 (0x8F) v CPU Core Temp 8 (0x90) v CPU Core Temp 9 (0x91) v CPU Core Temp 10 (0x92) v CPU Core Temp 11 (0x93) v CPU Core Temp 12 (0x94)

v CPU Core Temp 13 (0x95) v CPU Core Temp 14 (0x96) v CPU Core Temp 15 (0x97) v CPU Core Temp 16 (0x98) v CPU Core Temp 17 (0x99) v CPU Core Temp 18 (0x9A) v CPU Core Temp 19 (0x9B) v CPU Core Temp 20 (0x9C) v CPU Core Temp 21 (0x9D) v CPU Core Temp 22 (0x9E) v CPU Core Temp 23 (0x9F) v CPU Core Temp 24 (0xA0)

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

No service action is required.

50 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 67

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v 12V Sense (0xA1) v Proc0 Power (0xA2) v Proc1 Power (0xA3) v PCIE Proc0 Pwr (0xA6) v PCIE Proc1 Pwr (0xA7) v GPU Sense (0xAA) v Mem Cache Power (0xAB) v Mem Proc0 Pwr (0xAC) v Mem Proc1 Pwr (0xAD) v Fan Power A (0xB0)

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low

No service action required.

v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

v TOD Clock Fault (0xB1) v APSS Fault (0xB2)

v State Deasserted v State Asserted

No service action is required.

PS Derating Factor (0xB4) None No service action is required. OS Boot (0xB5)

v Installation aborted v Installation failed

Ensure that the operating system

boot image is loaded. Ensure that the

disk drive or solid-state drive is

ready. Reload the operating system

boot image.

v A: boot completed

No service action is required.

v C: boot completed v PXE boot completed v Diagnostic boot completed v CD-ROM boot completed v ROM boot completed v Boot completed - device not

specified

v Installation started v Installation completed

PCI (0xB6)

v State Deasserted

No service action is required.

v State Asserted

Beginning troubleshooting and problem analysis 51

Page 68

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v GPU Func 1 (0xB8) v GPU Func 2 (0xB9) v GPU Func 3 (0xBA) v GPU Func 4 (0xBB)

v GPU Temp 1 (0xBC) v GPU Temp 2 (0xBD) v GPU Temp 3 (0xBE) v GPU Temp 4 (0xBF)

v Mem Buf Temp 1 (0xC0) v Mem Buf Temp 2 (0xC1) v Mem Buf Temp 3 (0xC2) v Mem Buf Temp 4 (0xC3) v Mem Buf Temp 5 (0xC4) v Mem Buf Temp 6 (0xC5) v Mem Buf Temp 7 (0xC6) v Mem Buf Temp 8 (0xC7)

v Uncorrectable Memory Error v Parity v Memory Scrub Failed v Memory Device Disabled v Configuration Error v Memory Automatically Throttled

v Correctable Memory Error v Parity v Correctable Memory Error Logging

Limit Reached

v Presence Detected v Spare v Critical Over temperature

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

If the sensor name is GPU Func 1 or GPU Func 2, replace GPU 1. If the sensor name is GPU Func 3 or GPU Func 4, replace GPU 2. Go to “8335-GCA and 8335-GTA locations” on page 111 to identify the physical location and removal and replacement procedure.

No service action is required.

52 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 69

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v CPU Diode 1 (0xC8) v CPU Diode 2 (0xCB)

v Lower Non-critical – going low v Lower Non-critical – going high

No service action is required.

v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

Beginning troubleshooting and problem analysis 53

Page 70

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

Checkstop (0xC9) IERR If this event immediately precedes a

system power off, no service action is required. Otherwise, search for SEL events that meet the following criteria:

v The event has a time stamp in

close proximity to the time stamp of this event.

v A service action keyword is

present. For a list of service action keywords, see “Identifying service action keywords in system event logs” on page 36.

v Asserted is in the description.

If you found a SEL event that matches the criteria, perform the service action that is indicated in this table for the SEL event. Otherwise, go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page

110.

v Thermal Trip v Configuration Error v Processor Automatically Throttled v Correctable Machine Check Error v Processor Presence Detected

v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup

Initialization Failure

v SMBIOS Uncorrectable CPU

Complex Error

v Processor Disabled v Terminator Presence Detected v Machine Check Exception

No service action is required.

Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page

110.

54 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 71

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v PSU Fault 1 (0xCD) v PSU Fault 2 (0xCE)

Power Supply Failure Detected An assert event immediately

followed by a deassert event

indicates that a power cycle of the

system occurred. No service action is

required. If there is no deassert event

immediately following the assert

event, replace the power supply. If

the sensor name is PSU Fault 1,

replace PSU 1. If the sensor name is

PSU Fault 2, replace PSU 2. Go to

“8335-GCA and 8335-GTA locations”

on page 111 to identify the physical

location and removal and

replacement procedure.

v Predictive Failure v Power Supply Input Out of Range

But Present

If the sensor name is PSU Fault 1,

replace PSU 1. If the sensor name is

PSU Fault 2, replace PSU 2. Go to

“8335-GCA and 8335-GTA locations”

on page 111 to identify the physical

location and removal and

replacement procedure.

v Power Supply Input Lost or AC

v Power Supply Input Lost Or Out

Of Range

Ensure that ac power is supplied to

the rack. Ensure that the system

power cords are plugged tightly into

both the power supply and the rack

PDU unit for both system power

supplies. Go to “8335-GCA and

8335-GTA locations” on page 111 to

identify the physical location and

removal and replacement procedure.

Configuration Error Ensure that both power supplies are

securely seated in the system. Go to

“8335-GCA and 8335-GTA locations”

on page 111 to identify the physical

location and removal and

replacement procedure.

v Presence Detected

No service action is required.

v Power Supply Inactive

Beginning troubleshooting and problem analysis 55

Page 72

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

CPU VDD Volt (0xCF)

CPU VDD Curr (0xD0)

BIOS Golden Side (0xD2) None Go to “Resolving a system firmware

BMC Golden Side (0xD3) None Go to “Resolving a system firmware

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

No service action is required.

boot failure” on page 4 and follow the service action for a system event log (SEL) with the value OEM record c0 and OEM c0 specific log information 3a1504xxxxxx.

56 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 73

Table 21. Sensor information, event description, and service action for the 8335-GCA and 8335-GTA (continued)

Sensor name (Sensor ID) Event description Service action

v Fan 1 (0xD4) v Fan 2 (0xD5) v Fan 3 (0xD6) v Fan 4 (0xD7)

CurPwr Redundant (0xD8)

NxtPwr Redundant (0xD9)

Turbo Allowed (0xDA)

v Transition to Critical from less

Severe

v Transition to Non-recoverable from

less severe

v Transition to critical from

non-recoverable

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

v Device Inserted/Device Present

v Device Removed/Device Absent v Transition to degraded v Install error v Redundancy lost v Non-redundant insufficient

resources

v State Deasserted v State Asserted

If the sensor name is Fan 1, replace

Fan 1. If the sensor name is Fan 2,

replace Fan 2. And so on. Go to

“8335-GCA and 8335-GTA locations”

on page 111 to identify the physical

location and removal and

replacement procedure.

No service action is required.

Ensure that all fans are seated

securely. Go to “8335-GCA and

8335-GTA locations” on page 111 to

identify the physical location and

removal and replacement procedure.

No service action is required.

Identifying a service action by using sensor and event information for the 8335-GTB

You can use the sensor and event information from the system event log (SEL) to determine a service action to perform for the IBM Power System S822LC (8335-GTB).

If you have not done so already, complete “Identifying a service action by using system event logs” on page 27. Then, use the following table to determine the service action to perform.

Beginning troubleshooting and problem analysis 57

Page 74

Table 22. Sensor information, event description, and service action for the 8335-GTB

Sensor name (Sensor ID) Event description Service action

Watchdog (0x00)

Host Status (0x04) Unknown Go to Getting fixes and update the

v Timer Expired v Reserved1 v Reserved2 v Reserved3 v Reserved4

v Hard Reset v Power Down v Power Cycle v Timer Interrupt

v S0/Go “Working” v S1 “Sleeping with system h/w &

processor context maintained”

v S2 “sleeping, processor context

lost”

v S3 “sleeping, processor & h/w

context lost, memory retained”

v S4 “non-volatile sleep / suspend-to

disk”

v S5 / G2: “soft-off” v S4 / S5: “soft-off” v G3 mechanical Off v Sleeping in an S1/S2/S3 State v G1: Sleeping v S5: entered by override v Legacy ON state v Legacy OFF state

No service action is required.

SEL events with OEM record c0 | 000e000 | 3a150xxxxxxx indicate that a boot failed. Search for boot failure SEL events that have a time stamp close to the time stamp of this SEL event. If events exist, go to “Resolving a system firmware boot failure” on page 4. If there are no boot failure SEL events and the system booted correctly, no service action is required.

110. No service action is required.

58 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 75

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

FW Boot Progress (0x05)

v System Firmware Error v System Firmware Hang

SEL events with OEM record c0 |

000e000 | 3a150xxxxxxx indicate that

a boot failed. Search for boot failure

SEL events that have a time stamp

close to the time stamp of this SEL

event. If events exist, go to

“Resolving a system firmware boot

failure” on page 4.

System Firmware Progress No service action is required.

v OCC 1 Active (0x08) v OCC 2 Active (0x09)

Device Disabled If the sensor name is OCC 1 Active,

replace CPU 1. If the sensor name is

OCC 2 Active, replace CPU 2. Go to

“8335-GTB locations” on page 121 to

identify the physical location and

removal and replacement procedure.

v State Deasserted

No service action is required.

v Device Enabled

Ambient Temp (0x0A)

v Upper Critical - going low

No service action is required.

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Lower Critical - going low v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

Upper Critical - going high Ensure that the room temperature

meets the requirements that are

specified for the system. Ensure that

no obstructions are blocking air flow

to the system.

Beginning troubleshooting and problem analysis 59

Page 76

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v CPU1 Temp (0x0B) v CPU2 Temp (0x0D)

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical - going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Lower Critical - going low v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

No service action is required.

60 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 77

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v CPU Func 1 (0x0C) v CPU Func 2 (0x0E)

v IERR v Transition to Non-recoverable v Predictive Failure

If the sensor name is CPU Func 1,

replace CPU 1. If the sensor name is

CPU Func 2, replace CPU 2. Go to

“8335-GTB locations” on page 121 to

identify the physical location and

removal and replacement procedure.

v Thermal Trip

No service action is required.

v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup

Initialization Failure

v Configuration Error v SMBIOS Uncorrectable CPU

Complex Error

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

Non-recoverable

v Processor Presence Detected v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from

More Severe

v Monitor v Informational

Beginning troubleshooting and problem analysis 61

Page 78

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

All PGood (0x1C)

v Interlock Power Down v Power Off Power Down v Power Cycle v 240VA Power Down

No service action is required.

v AC Lost v Soft Power Control Failure

v Power Unit Failure Detected v Predictive Failure

v Ensure that ac power is supplied

to the rack.

v Ensure that the system power

cords are plugged tightly into both the power supply and the rack power distribution unit (PDU) for both system power supplies.

v Ensure that the system was not

powered off.

v Ensure that ac power is supplied

to the rack.

v Ensure that the power supply

cords are plugged tightly into the power supplies and the rack PDU unit.

v Ensure that the system was not

powered off.

v Check for service action required

62 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 79

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v Memory Device Disabled v Uncorrectable Memory Error v Memory Scrub Failed v State Deasserted v Device Disabled v Transition to Critical from Less

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

Non-recoverable

v Correctable Memory Error v Parity v Correctable Memory Error Logging

Limit Reached

More Severe

v Monitor v Informational

v Transition to Non-recoverable v Predictive Failure

No service action is required.

If the sensor name is DIMM Func 1,

replace DIMM 1. If the sensor name

is DIMM Func 2, replace DIMM 2.

And so on. Go to “8335-GTB

locations” on page 121 to identify the

physical location and removal and

replacement procedure.

Beginning troubleshooting and problem analysis 63

Page 80

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

Configuration Error Complete the following steps:

1. If the sensor name is DIMM Func

1, ensure that DIMM 1 is seated properly. If the sensor name is DIMM Func 2, ensure that DIMM 2 is seated properly. And so on.

2. If you recently installed or

replaced memory DIMMs, ensure that the DIMMs are plugged in the correct memory slots.

3. If the sensor name is DIMM Func

1, replace DIMM 1. If the sensor name is DIMM Func 2, replace DIMM 2. And so on. Go to “8335-GTB locations” on page 121 to identify the physical location and removal and replacement procedure.

64 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 81

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v IERR v Transition to Non-recoverable v Predictive Failure

v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup

Initialization Failure

v Configuration Error v SMBIOS Uncorrectable CPU

Complex Error

v Processor Disabled v Terminator Presence Detected

Replace system processor CPU 1. Go

to “8335-GTB locations” on page 121

to identify the physical location and

removal and replacement procedure.

No service action is required.

v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

Non-recoverable

More Severe

v Monitor v Informational

Beginning troubleshooting and problem analysis 65

Page 82

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v IERR v Transition to Non-recoverable v Predictive Failure

v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup

Initialization Failure

v Configuration Error v SMBIOS Uncorrectable CPU

Complex Error

v Processor Disabled v Terminator Presence Detected v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

Non-recoverable

More Severe

v Monitor v Informational

Replace system processor CPU 2. Go to “8335-GTB locations” on page 121 to identify the physical location and removal and replacement procedure.

No service action is required.

66 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 83

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v Mem Buf Func 1 (0x56) v Mem Buf Func 2 (0x57) v Mem Buf Func 3 (0x58) v Mem Buf Func 4 (0x59) v Mem Buf Func 5 (0x5A) v Mem Buf Func 6 (0x5B) v Mem Buf Func 7 (0x5C) v Mem Buf Func 8 (0x5D)

v Uncorrectable Memory Error v Memory Device Disabled v State Deasserted v Device Disabled v Transition to Critical from Less

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

No service action is required.

Non-recoverable

v Correctable Memory Error v Parity v Memory Scrub Failed v Correctable Memory Error Logging

Limit Reached

More Severe

v Monitor v Informational

v Configuration Error v Transition to Non-recoverable v Predictive Failure

If the sensor name is Mem Buf Func

1, replace memory riser 1. If the

sensor name is Mem Buf Func 2,

replace memory riser 2. And so on.

Go to “8335-GTB locations” on page

121 to identify the physical location

and removal and replacement

procedure. Boot Count (0x5F) None No service action is required. Motherboard Flt (0x60) State Deasserted No service action is required.

State Asserted Replace the system backplane. Go to

“8335-GTB locations” on page 121 to

identify the physical location and

removal and replacement procedure.

Beginning troubleshooting and problem analysis 67

Page 84

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

System Event (0x61) Undetermined system hardware

failure

v System Reconfigured v OEM System boot event v Entry added to auxiliary log v PEF Action v Timestamp Clock Sync v Transition State Active v Transition State Idle v Transition State Busy

Activate Pwr Lt (0x62) None No service action is required.

v Ref Clock Fault (0x63) v PCI Clock Fault (0x64)

v State Deasserted v State Asserted

Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page

110. No service action is required.

No service action is required.

68 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 85

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v DIMM1 Temp (0x69) v DIMM2 Temp (0x6A) v DIMM3 Temp (0x6B) v DIMM4 Temp (0x6C) v DIMM5 Temp (0x6D) v DIMM6 Temp (0x6E) v DIMM7 Temp (0x6F) v DIMM8 Temp (0x70) v DIMM9 Temp (0x71) v DIMM10 Temp (0x72) v DIMM11 Temp (0x73) v DIMM12 Temp (0x74) v DIMM13 Temp (0x75) v DIMM14 Temp (0x76) v DIMM15 Temp (0x77)

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

No service action is required.

Beginning troubleshooting and problem analysis 69

Page 86

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

No service action is required.

70 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 87

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v System Power (0xA1) v Proc0 Power (0xA2) v Proc1 Power (0xA3) v PCIE Proc0 Pwr (0xA6) v PCIE Proc1 Power (0xA7) v GPU Power (0xAA) v Mem Cache Power (0xAB) v Mem Proc0 Pwr (0xAC) v Mem Proc1 Pwr (0xAD) v Fan Power (0xB0)

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low

No service action required.

v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

v TOD Clock Fault (0xB1) v APSS Fault (0xB2)

v State Deasserted v State Asserted

No service action is required.

PS Derating Fac (0xB4) None No service action is required. OS Boot (0xB5)

v Installation aborted v Installation failed

Ensure that the operating system

boot image is loaded. Ensure that the

disk drive or solid-state drive is

ready. Reload the operating system

boot image.

v A: boot completed

No service action is required.

v C: boot completed v PXE boot completed v Diagnostic boot completed v CD-ROM boot completed v ROM boot completed v Boot completed - device not

specified

v Installation started v Installation completed

PCI (0xB6)

v State Deasserted

No service action is required.

v State Asserted

Beginning troubleshooting and problem analysis 71

Page 88

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v GPU Func 1 (0xB8) v GPU Func 2 (0xB9) v GPU Func 3 (0xBA) v GPU Func 4 (0xBB)

v GPU Temp 1 (0xBC) v GPU Temp 2 (0xBD) v GPU Temp 3 (0xBE) v GPU Temp 4 (0xBF)

v Mem Buf Temp 1 (0xC0) v Mem Buf Temp 2 (0xC1) v Mem Buf Temp 3 (0xC2) v Mem Buf Temp 4 (0xC3) v Mem Buf Temp 5 (0xC4) v Mem Buf Temp 6 (0xC5) v Mem Buf Temp 7 (0xC6) v Mem Buf Temp 8 (0xC7)

v Uncorrectable Memory Error v Parity v Memory Scrub Failed v Memory Device Disabled v Configuration Error v Memory Automatically Throttled

v Correctable Memory Error v Parity v Correctable Memory Error Logging

Limit Reached

v Presence Detected v Spare v Critical Over temperature

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

If the sensor name is GPU Func 1, replace GPU 1. If the sensor name is GPU Func 2, replace GPU 2. And so on. Go to “8335-GTB locations” on page 121 to identify the physical location and removal and replacement procedure.

No service action is required.

72 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 89

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v CPU Diode 1 (0xC8) v CPU Diode 2 (0xCB)

v Lower Non-critical – going low v Lower Non-critical – going high

No service action is required.

v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

Beginning troubleshooting and problem analysis 73

Page 90

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

Checkstop (0xC9) IERR If this event immediately precedes a

system power off, no service action is required. Otherwise, search for SEL events that meet the following criteria:

v The event has a time stamp close

to the time stamp of this event.

v A service action keyword is

present. For a list of service action keywords, see “Identifying service action keywords in system event logs” on page 36.

v Asserted is in the description.

110.

v Thermal Trip v Configuration Error v Processor Automatically Throttled v Correctable Machine Check Error v Processor Presence Detected

v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup

Initialization Failure

v SMBIOS Uncorrectable CPU

Complex Error

v Processor Disabled v Terminator Presence Detected v Machine Check Exception

No service action is required.

Go to “Collecting diagnostic data” on page 109. Then, go to “Contacting IBM service and support” on page

110.

74 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 91

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v PSU Fault 1 (0xCD) v PSU Fault 2 (0xCE)

Power Supply Failure Detected An assert event immediately

followed by a deassert event

indicates that a power cycle of the

system occurred. No service action is

required. If there is no deassert event

immediately following the assert

event, replace the power supply. If

the sensor name is PSU Fault 1,

replace PSU 1. If the sensor name is

PSU Fault 2, replace PSU 2. Go to

“8335-GTB locations” on page 121 to

identify the physical location and

removal and replacement procedure.

v Predictive Failure v Power Supply Input Out of Range

But Present

If the sensor name is PSU Fault 1,

replace PSU 1. If the sensor name is

PSU Fault 2, replace PSU 2. Go to

“8335-GTB locations” on page 121 to

identify the physical location and

removal and replacement procedure.

v Power Supply Input Lost or AC

v Power Supply Input Lost Or Out

Of Range

Ensure that ac power is supplied to

the rack. Ensure that the system

power cords are plugged tightly into

both the power supply and the rack

PDU unit for both system power

supplies. Go to “8335-GTB locations”

on page 121 to identify the physical

location and removal and

replacement procedure.

Configuration Error Ensure that both power supplies are

securely seated in the system. Go to

“8335-GTB locations” on page 121 to

identify the physical location and

removal and replacement procedure.

v Presence Detected

No service action is required.

v Power Supply Inactive

CPU VDD Volt (0xCF)

v Lower Non-critical – going low

No service action is required.

v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

Beginning troubleshooting and problem analysis 75

Page 92

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

CPU VDD Curr (0xD0)

BIOS Golden Side (0xD2) None Go to “Resolving a system firmware

BMC Golden Side (0xD3) None Go to “Resolving a system firmware

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

No service action is required.

boot failure” on page 4 and follow the service action for a system event log (SEL) with the value OEM record c0 and OEM c0 specific log information 3a1504xxxxxx.

76 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 93

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v Fan 1 (0xD4) v Fan 2 (0xD5) v Fan 3 (0xD6) v Fan 4 (0xD7)

v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going low v Lower Critical – going high

No service action is required.

v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

v Device Inserted/Device Present

v Device Removed/Device Absent v Transition to degraded v Install error v Redundancy lost

Ensure that all fans are seated

securely. Go to “8335-GTB locations”

on page 121 to identify the physical

location and removal and

replacement procedure.

v Non-redundant insufficient

resources

CurPwr Redundant (0xD8)

v State Deasserted

No service action is required.

v State Asserted

NxtPwr Redundant (0xD9)

v State Deasserted

No service action is required.

v State Asserted

Turbo Allowed (0xDA)

v State Deasserted

No service action is required.

v State Asserted

v Freq Limit OT 1 (0xDB) v Freq Limit OT 2 (0xDF)

v Freq Limit Pwr 1 (0xDC) v Freq Limit Pwr 2 (0xE0)

v Mem Thrtl OT 1 (0xDD) v Mem Thrtl OT 2 (0xE1)

v State Deasserted v State Asserted

No service action is required.

Beginning troubleshooting and problem analysis 77

Page 94

Table 22. Sensor information, event description, and service action for the 8335-GTB (continued)

Sensor name (Sensor ID) Event description Service action

v Quick Pwr Drop 1 (0xDE) v Quick Pwr Drop 2 (0xE2)

Water Cooled (0xE3) None No service action is required. CPU 1 VDD Temp (0xE4) Upper Critical - going high If the system is a water-cooled

CPU 2 VDD Temp (0xE5) Upper Critical - going high If the system is a water-cooled

State Deasserted No service action is required. State Asserted

v Ensure that ac power is supplied

to the rack.

v Ensure that the power supply

cords are plugged tightly into the power supplies and the rack PDU unit.

v Check for service action required

system, go to “Resolving an over temperature problem for a water-cooled 8335-GTB system” on page 26. If the system is an air-cooled system, ensure that there are no air flow obstructions at the front or at the rear of the system. Ensure that the fans are operating properly.

Identifying a service action by using sensor and event information for the 8348-21C

You can use the sensor and event information from the system event log to determine a service action to perform for the IBM Power System S812LC (8348-21C).

If you have not done so already, complete “Identifying a service action by using system event logs” on page 27. Then, use the following table to determine the service action to perform.

78 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 95

Table 23. Sensor information, event description, and service action for the 8348-21C

Sensor name (Sensor ID) Event description Service action

Watchdog (0x00)

v Timer Expired

No service action is required.

v Reserved1 v Reserved2 v Reserved3 v Reserved4

v Hard Reset v Power Down v Power Cycle v Timer Interrupt

SEL events with OEM record c0 |

000e000 | 3a150xxxxxxx indicate that

a boot failed. Search for boot failure

SEL events that have a time stamp in

close proximity to the time stamp of

this SEL event. If events exist, go to

“Resolving a system firmware boot

failure” on page 4. If there are no

boot failure SEL events and the

system booted correctly, no service

action is required. Host Status (0x04) Unknown Go to Getting fixes and update the

system firmware to the most recent

level of firmware that is available. If

this SEL event continues to be logged

each time you power on the system,

go to “Collecting diagnostic data” on

page 109. Then, go to “Contacting

IBM service and support” on page

110.

v S0/Go “Working”

No service action is required.

v S1 “Sleeping with system h/w &

processor context maintained”

v S2 “sleeping, processor context

lost”

v S3 “sleeping, processor & h/w

context lost, memory retained”

v S4 “non-volatile sleep / suspend-to

disk”

v S5 / G2: “soft-off” v S4 / S5: “soft-off” v G3 mechanical Off v Sleeping in an S1/S2/S3 State v G1: Sleeping v S5: entered by override v Legacy ON state v Legacy OFF state

Beginning troubleshooting and problem analysis 79

Page 96

Table 23. Sensor information, event description, and service action for the 8348-21C (continued)

Sensor name (Sensor ID) Event description Service action

FW Boot Progress (0x05)

OCC Active (0x08) Device Disabled Replace the system processor. Go to

Ambient Temp (0x0A)

v System Firmware Error v System Firmware Hang

System Firmware Progress No service action is required.

v State Deasserted v Device Enabled

v Upper Critical - going low v Lower Non-critical – going low v Lower Non-critical – going high v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Lower Critical - going low v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

Upper Critical - going high Ensure that the room temperature

“8348-21C locations” on page 133 to identify the physical location and removal and replacement procedure.

No service action is required.

meets the requirements that are specified for the system. Ensure that no obstructions are blocking air flow to the system.

80 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 97

Table 23. Sensor information, event description, and service action for the 8348-21C (continued)

Sensor name (Sensor ID) Event description Service action

CPU Temp (0x64)

v Lower Non-critical – going low

No service action is required.

v Lower Non-critical – going high v Lower Critical - going low v Lower Critical – going high v Lower Non-recoverable – going

low

v Lower Non-recoverable – going

high

v Upper Non-critical – going low v Upper Non-critical – going high v Upper Critical - going low v Upper Critical - going high v Lower Critical - going low v Upper Non-recoverable – going

low

v Upper Non-recoverable – going

high

Beginning troubleshooting and problem analysis 81

Page 98

Table 23. Sensor information, event description, and service action for the 8348-21C (continued)

Sensor name (Sensor ID) Event description Service action

CPU Func (0x4E)

v IERR v Transition to Non-recoverable v Predictive Failure

v Processor Disabled v Thermal Trip v FRB1 BIST Failure v FRB2 Hang In POST Failure v FRB3 Processor Startup

Initialization Failure

v Configuration Error v SMBIOS Uncorrectable CPU

Complex Error

v Terminator Presence Detected v Processor Automatically Throttled v Machine Check Exception v Correctable Machine Check Error v State Deasserted v Device Disabled v Transition to Critical from Less

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

Non-recoverable

v Processor Presence Detected v State Asserted v Device Enabled v Transition to OK v Transition to Non-Critical from OK v Transition to Non-Critical from

More Severe

v Monitor v Informational

Replace the system processor. Go to “8348-21C locations” on page 133 to identify the physical location and removal and replacement procedure.

No service action is required.

82 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

Page 99

Table 23. Sensor information, event description, and service action for the 8348-21C (continued)

Sensor name (Sensor ID) Event description Service action

All PGood (0x1C)

v Interlock Power Down

No service action is required.

v Power Off Power Down v Power Cycle v 240VA Power Down

v AC Lost v Soft Power Control Failure

v Power Unit Failure Detected v Predictive Failure

v Ensure that ac power is supplied

to the rack.

v Ensure that the system power

cords are plugged tightly into both the power supply and the rack power distribution unit (PDU) for both system power supplies.

v Ensure that the system was not

powered off.

v Ensure that ac power is supplied

to the rack.

v Ensure that the power supply

cords are plugged tightly into the power supplies and the rack PDU unit.

v Ensure that the system was not

powered off.

v Check for service action required

SEL events for the power supply sensor. If any exist, follow the service action that is specified in “Identifying a service action by using sensor and event information for the 8348-21C” on page 78.

Beginning troubleshooting and problem analysis 83

Page 100

Table 23. Sensor information, event description, and service action for the 8348-21C (continued)

Sensor name (Sensor ID) Event description Service action

v DIMM Func 0 (0x1E) v DIMM Func 1 (0x1F) v DIMM Func 2 (0x20) v DIMM Func 3 (0x21) v DIMM Func 4 (0x22) v DIMM Func 5 (0x23) v DIMM Func 6 (0x24) v DIMM Func 7 (0x25) v DIMM Func 8 (0x26) v DIMM Func 9 (0x27) v DIMM Func 10 (0x28) v DIMM Func 11 (0x29) v DIMM Func 12 (0x2A) v DIMM Func 13 (0x2B) v DIMM Func 14 (0x2C) v DIMM Func 15 (0x2D) v DIMM Func 16 (0x2E) v DIMM Func 17 (0x2F) v DIMM Func 18 (0x30) v DIMM Func 19 (0x31) v DIMM Func 20 (0x32) v DIMM Func 21 (0x33) v DIMM Func 22 (0x34) v DIMM Func 23 (0x35) v DIMM Func 24 (0x36) v DIMM Func 25 (0x37) v DIMM Func 26 (0x38) v DIMM Func 27 (0x39) v DIMM Func 28 (0x3A) v DIMM Func 29 (0x3B) v DIMM Func 30 (0x3C) v DIMM Func 31 (0x3D)

v Memory Device Disabled v Uncorrectable Memory Error v Memory Scrub Failed v State Deasserted v Device Disabled v Transition to Critical from Less

Severe

v Transition to Non-recoverable from

Less Severe

v Transition to Critical from

Non-recoverable

v Correctable Memory Error v Parity v Correctable Memory Error Logging

Limit Reached

More Severe

v Monitor v Informational

v Transition to Non-recoverable v Predictive Failure

No service action is required.

If the sensor name is DIMM Func 0, replace DIMM 0. If the sensor name is DIMM Func 1, replace DIMM 1. And so on. Go to “8348-21C locations” on page 133 to identify the physical location and removal and replacement procedure.

84 Problem analysis, system parts, and locations for the 8335-GCA, 8335-GTA, 8335-GTB, and 8348-21C

IBM Power System 8335-GCA, Power System S812LC, Power System 8335-GTB, Power System S822LC, Power System 8348-21C User Manual

Specifications and Main Features

Frequently Asked Questions

User Manual

Contents

Safety notices

Beginning troubleshooting and problem analysis

Determining the problem analysis procedure to perform

Resolving a BMC access problem

Resolving a power problem

Resolving a system firmware boot failure

Resolving a VGA monitor problem

Resolving an operating system boot failure

Resolving a sensor indicator problem

Resolving a hardware problem

Resolving a GPU, PCIe adapter, or device problem

Resolving a RAID adapter problem

Resolving a network adapter problem

Resolving a graphics processing unit problem

Resolving an NVMe Flash adapter problem

Resolving a storage device problem

Identifying the location of the PCIe adapter by using the slot number

Identifying the location of the GPU

Identifying the location of the NVMe Flash adapter

Identifying the location of the storage device

User guides for GPUs and PCIe adapters

Resolving an over temperature problem for a water-cooled 8335-GTB system

Identifying a service action

Identifying a service action by using system event logs

Identifying service action keywords in system event logs

Identifying a service action by using sensor and event information

Identifying a service action by using sensor and event information for the 8335-GCA and 8335-GTA

Identifying a service action by using sensor and event information for the 8335-GTB

Identifying a service action by using sensor and event information for the 8348-21C