Kontron S4600 SEL Troubleshooting

Page 1

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Intel order number G90620-002

Revision 1.1

September 2013

Enterprise Platforms and Services Division – Marketing

Page 2

Revision History System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5

Date

Revision

Number

Modifications

January 2013

1.0

Initial release

September 2013

1.1

 Added MIC Thermal Margin sensors C4 through C7.  Added MIC Status sensors A2, A3, A6, and A7.  Added voltage sensors EA, EB, EC, ED, and EF.  Corrected typographical errors.  Made corrections to Firmware Update Status table.  Made corrections to Catastrophic Error Sensor table.  Added support for S1400FP, S1400SP, S1600JP, and S4600LH.

4600/2600/2400/1600/1400 Product Families

Revision History

ii Intel order number G90620-002 Revision 1.1

Page 3

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families Disclaimers

Disclaimers

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.

Revision 1.1 Intel order number G90620-002 iii

Page 4

Table of Contents System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5

4600/2600/2400/1600/1400 Product Families

Table of Contents

1. Introduction .......................................................................................................................... 1

1.1 Purpose ................................................................................................................... 1

1.2 Industry Standard .................................................................................................... 2

1.2.1 Intelligent Platform Management Interface (IPMI) ................................................... 2

1.2.2 Baseboard Management Controller (BMC) ............................................................. 2

1.2.3 Intel® Intelligent Power Node Manager Version 2.0 ................................................. 3

2. Basic Decoding of a SEL Record ........................................................................................ 4

2.1 Default Values in the SEL Records ......................................................................... 4

2.2 Notes on SEL Logs and Collecting SEL Information ............................................. 10

2.2.1 Examples of Decoding BIOS Timestamp Events .................................................. 10

2.2.2 Example of Decoding a PCI Express* Correctable Error Events ........................... 11

2.2.3 Example of Decoding a Power Supply Predictive Failure Event............................ 12

3. Sensor Cross Reference List ............................................................................................ 13

3.1 BMC owned Sensors (GID = 0020h) ..................................................................... 13

3.2 BIOS POST owned Sensors (GID = 0001h) .......................................................... 24

3.3 BIOS SMI Handler owned Sensors (GID = 0033h) ................................................ 24

3.4 Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch) ............. 25

3.5 Microsoft* OS owned Events (GID = 0041) ........................................................... 26

3.6 Linux* Kernel Panic Events (GID = 0021) .............................................................. 26

4. Power Subsystems ............................................................................................................ 27

4.1 Threshold-based Voltage Sensors ........................................................................ 27

4.2 Voltage Regulator Watchdog Timer Sensor .......................................................... 33

4.2.1 Voltage Regulator Watchdog Timer Sensor – Next Steps ..................................... 34

4.3 Power Unit ............................................................................................................. 34

4.3.1 Power Unit Status Sensor ...................................................................................... 34

4.3.2 Power Unit Redundancy Sensor ............................................................................ 36

4.3.3 Node Auto Shutdown Sensor ................................................................................ 37

4.4 Power Supply ......................................................................................................... 38

4.4.1 Power Supply Status Sensors ............................................................................... 38

4.4.2 Power Supply Power In Sensors ........................................................................... 41

4.4.3 Power Supply Current Out % Sensors .................................................................. 42

4.4.4 Power Supply Temperature Sensors ..................................................................... 43

4.4.5 Power Supply Fan Tachometer Sensors ............................................................... 44

5. Cooling Subsystem ............................................................................................................ 45

5.1 Fan Sensors .......................................................................................................... 45

5.1.1 Fan Tachometer Sensors ...................................................................................... 45

5.1.2 Fan Presence and Redundancy Sensors .............................................................. 46

5.2 Temperature Sensors ............................................................................................ 49

iv Intel order number G90620-002 Revision 1.1

Page 5

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families Table of Contents

5.2.1 Threshold-based Temperature Sensors ................................................................ 49

5.2.2 Thermal Margin Sensors ....................................................................................... 51

5.2.3 Processor Thermal Control Sensors ...................................................................... 53

5.2.4 Processor DTS Thermal Margin Sensors .............................................................. 55

5.2.5 Discrete Thermal Sensors ..................................................................................... 55

5.2.6 DIMM Thermal Trip Sensors .................................................................................. 57

5.3 System Air Flow Monitoring Sensor ....................................................................... 58

6. Processor Subsystem ........................................................................................................ 59

6.1 Processor Status Sensor ....................................................................................... 59

6.2 Catastrophic Error Sensor ..................................................................................... 61

6.3 CPU Missing Sensor .............................................................................................. 62

6.3.1 CPU Missing Sensor – Next Steps ........................................................................ 63

6.4 Quick Path Interconnect Sensors .......................................................................... 63

6.4.1 QPI Link Width Reduced Sensor ........................................................................... 63

6.4.2 QPI Correctable Error Sensor ................................................................................ 64

6.4.3 QPI Fatal Error and Fatal Error #2 ......................................................................... 65

6.5 Processor ERR2 Timeout Sensor .......................................................................... 67

6.5.1 Processor ERR2 Timeout – Next Steps ................................................................ 68

6.6 Processor MSID Mismatch Sensor ........................................................................ 68

6.6.1 Processor MSID Mismatch Sensor – Next Steps .................................................. 69

7. Memory Subsystem ........................................................................................................... 70

7.1 Memory RAS Configuration Status ........................................................................ 70

7.2 Memory RAS Mode Select .................................................................................... 72

7.3 Mirroring Redundancy State ................................ .................................................. 73

7.3.1 Mirroring Redundancy State Sensor – Next Steps ................................................ 74

7.4 Sparing Redundancy State .................................................................................... 74

7.4.1 Sparing Redundancy State Sensor – Next Steps .................................................. 76

7.5 ECC and Address Parity ........................................................................................ 76

7.5.1 Memory Correctable and Uncorrectable ECC Error .............................................. 76

7.5.2 Memory Address Parity Error ................................................................................ 78

8. PCI Express* and Legacy PCI Subsystem ....................................................................... 81

8.1 PCI Express* Errors ............................................................................................... 81

8.1.1 Legacy PCI Errors ................................................................................................. 81

8.1.2 PCI Express* Fatal Errors and Fatal Error #2 ........................................................ 82

8.1.3 PCI Express* Correctable Errors ........................................................................... 84

9. System BIOS Events .......................................................................................................... 87

9.1 System Events ....................................................................................................... 87

9.1.1 System Boot .......................................................................................................... 87

9.1.2 Timestamp Clock Synchronization ........................................................................ 87

9.2 System Firmware Progress (Formerly Post Error) ................................................. 89

Revision 1.1 Intel order number G90620-002 v

Page 6

Table of Contents System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5

4600/2600/2400/1600/1400 Product Families

9.2.1 System Firmware Progress (Formerly Post Error) – Next Steps ........................... 89

10. Chassis Subsystem ........................................................................................................... 97

10.1 Physical Security ................................................................................................... 97

10.1.1 Chassis Intrusion ................................................................................................... 97

10.1.2 LAN Leash Lost ..................................................................................................... 97

10.2 FP (NMI) Interrupt .................................................................................................. 98

10.2.1 FP (NMI) Interrupt – Next Steps ............................................................................ 99

10.3 Button Sensor ...................................................................................................... 100

11. Miscellaneous Events ................................ ...................................................................... 101

11.1 IPMI Watchdog .................................................................................................... 101

11.2 SMI Timeout ........................................................................................................ 102

11.2.1 SMI Timeout – Next Steps ................................................................................... 103

11.3 System Event Log Cleared .................................................................................. 103

11.4 System Event – PEF Action ................................................................................. 104

11.4.1 System Event – PEF Action – Next Steps ........................................................... 104

11.5 BMC Watchdog Sensor ....................................................................................... 105

11.5.1 BMC Watchdog Sensor – Next Steps .................................................................. 105

11.6 BMC FW Health Sensor ...................................................................................... 106

11.6.1 BMC FW Health Sensor – Next Steps ................................................................. 106

11.7 Firmware Update Status Sensor .......................................................................... 107

11.8 Add-In Module Presence Sensor ......................................................................... 108

11.8.1 Add-In Module Presence – Next Steps ................................................................ 108

11.9 Intel® Xeon Phi™ Coprocessor Management Sensors ......................................... 109

11.9.1 Intel® Xeon Phi™ Coprocessor (MIC) Thermal Margin Sensors ........................... 109

11.9.2 Intel® Xeon Phi™ Coprocessor (MIC) Status Sensors .......................................... 109

12. Hot-Swap Controller Backplane Events ......................................................................... 111

12.1 HSC Backplane Temperature Sensor ................................................................. 111

12.2 Hard Disk Drive Monitoring Sensor ..................................................................... 112

12.3 Hot-Swap Controller Health Sensor ..................................................................... 113

12.3.1 HSC Health Sensor – Next Steps ........................................................................ 114

13. Manageability Engine (ME) Events ................................................................................. 115

13.1 ME Firmware Health Event .................................................................................. 115

13.1.1 ME Firmware Health Event – Next Steps ............................................................ 115

13.2 Node Manager Exception Event .......................................................................... 117

13.2.1 Node Manager Exception Event – Next Steps .................................................... 117

13.3 Node Manager Health Event ............................................................................... 118

13.3.1 Node Manager Health Event – Next Steps .......................................................... 119

13.4 Node Manager Operational Capabilities Change ................................................ 120

13.4.1 Node Manager Operational Capabilities Change – Next Steps ........................... 121

13.5 Node Manger Alert Threshold Exceeded ............................................................. 122

vi Intel order number G90620-002 Revision 1.1

Page 7

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families Table of Contents

13.5.1 Node Manger Alert Threshold Exceeded – Next Steps ....................................... 123

14. Microsoft Windows* Records .......................................................................................... 124

14.1 Boot up Event Records ................................ ........................................................ 124

14.2 Shutdown Event Records .................................................................................... 126

14.3 Bug Check / Blue Screen Event Records ............................................................ 128

15. Linux* Kernel Panic Records .......................................................................................... 130

Revision 1.1 Intel order number G90620-002 vii

Page 8

List of Tables System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5

4600/2600/2400/1600/1400 Product Families

List of Tables

Table 1. SEL Record Format ........................................................................................................ 4

Table 2: Event Request Message Event Data Field Contents ...................................................... 7

Table 3: OEM SEL Record (Type C0h-DFh) ................................................................................ 8

Table 4: OEM SEL Record (Type E0h-FFh) ................................................................................. 9

Table 5: BMC owned Sensors .................................................................................................... 13

Table 6: BIOS POST owned Sensors ......................................................................................... 24

Table 7: BIOS SMI Handler owned Sensors ............................................................................... 24

Table 8: Management Engine Firmware owned Sensors ........................................................... 25

Table 9: Microsoft* OS owned Events ........................................................................................ 26

Table 10: Linux* Kernel Panic Events ......................................................................................... 26

Table 11: Threshold-based Voltage Sensors Typical Characteristics......................................... 27

Table 12: Threshold-based Voltage Sensors Event Triggers – Description ............................... 28

Table 13: Threshold-based Voltage Sensors – Next Steps ........................................................ 28

Table 14: Voltage Regulator Watchdog Timer Sensor Typical Characteristics .......................... 34

Table 15: Power Unit Status Sensors Typical Characteristics .................................................... 35

Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps ............................ 35

Table 17: Power Unit Redundancy Sensors Typical Characteristics .......................................... 36

Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps ....................... 37

Table 19: Node Auto Shutdown Sensor Typical Characteristics ................................................ 37

Table 20: Power Supply Status Sensors Typical Characteristics ............................................... 38

Table 21: Power Supply Status Sensor – Sensor Specific Offsets – Next Steps ....................... 39

Table 22: Power Supply Power In Sensors Typical Characteristics ........................................... 41

Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next Steps ........................ 41

Table 24: Power Supply Current Out % Sensors Typical Characteristics .................................. 42

Table 25: Power Supply Current Out % Sensor – Event Trigger Offset – Next Steps ................ 42

Table 26: Power Supply Temperature Sensors Typical Characteristics ..................................... 43

Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps .................. 43

Table 28: Power Supply Fan Tachometer Sensors Typical Characteristics ............................... 44

Table 29: Fan Tachometer Sensors Typical Characteristics ...................................................... 45

Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps ................................... 46

Table 31: Fan Presence Sensors Typical Characteristics .......................................................... 46

Table 32: Fan Presence Sensors – Event Trigger Offset – Next Steps ...................................... 47

Table 33: Fan Redundancy Sensors Typical Characteristics ..................................................... 47

Table 34: Fan Redundancy Sensor – Event Trigger Offset – Next Steps .................................. 48

Table 35: Temperature Sensors Typical Characteristics ............................................................ 49

Table 36: Temperature Sensors Event Triggers – Description ................................................... 50

Table 37: Temperature Sensors – Next Steps ............................................................................ 50

Table 38: Thermal Margin Sensors Typical Characteristics ....................................................... 51

viii Intel order number G90620-002 Revision 1.1

Page 9

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families List of Tables

Table 39: Thermal Margin Sensors Event Triggers – Description .............................................. 52

Table 40: Thermal Margin Sensors – Next Steps ....................................................................... 52

Table 41: Processor Thermal Control Sensors Typical Characteristics ...................................... 53

Table 42: Processor Thermal Control Sensors Event Triggers – Description............................. 54

Table 43: Processor DTS Thermal Margin Sensors Typical Characteristics .............................. 55

Table 44: Discrete Thermal Sensors Typical Characteristics ..................................................... 56

Table 45: Discrete Thermal Sensors – Next Steps ..................................................................... 56

Table 46: DIMM Thermal Trip Typical Characteristics ................................................................ 57

Table 47: Process Status Sensors Typical Characteristics ........................................................ 59

Table 48: Processor Status Sensors – Next Steps ..................................................................... 60

Table 49: Catastrophic Error Sensor Typical Characteristics ..................................................... 61

Table 50: Catastrophic Error Sensor – Event Data 2 Values – Next Steps ................................ 61

Table 51: CPU Missing Sensor Typical Characteristics .............................................................. 62

Table 52: QPI Link Width Reduced Sensor Typical Characteristics ........................................... 63

Table 53: QPI Correctable Error Sensor Typical Characteristics ................................................ 64

Table 54: QPI Fatal Error Sensor Typical Characteristics .......................................................... 65

Table 55: QPI Fatal #2 Error Sensor Typical Characteristics ..................................................... 66

Table 56: Processor ERR2 Timeout Sensor Typical Characteristics .......................................... 68

Table 57: Processor MSID Mismatch Sensor Typical Characteristics ........................................ 69

Table 58: Memory RAS Configuration Status Sensor Typical Characteristics............................ 70

Table 59: Memory RAS Configuration Status Sensor – Event Trigger Offset – Next Steps ....... 71

Table 60: Memory RAS Mode Select Sensor Typical Characteristics ........................................ 72

Table 61: Mirroring Redundancy State Sensor Typical Characteristics ...................................... 73

Table 62: Sparing Redundancy State Sensor Typical Characteristics ....................................... 75

Table 63: Correctable and Uncorrectable ECC Error Sensor Typical Characteristics ................ 76

Table 64: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps77

Table 65: Address Parity Error Sensor Typical Characteristics .................................................. 78

Table 66: Legacy PCI Error Sensor Typical Characteristics ....................................................... 81

Table 67: PCI Express* Fatal Error Sensor Typical Characteristics ........................................... 82

Table 68: PCI Express* Fatal Error #2 Sensor Typical Characteristics ...................................... 83

Table 69: PCI Express* Correctable Error Sensor Typical Characteristics ................................ 85

Table 70: System Event Sensor Typical Characteristics ............................................................ 88

Table 71: POST Error Sensor Typical Characteristics ................................................................ 89

Table 72: POST Error Codes ...................................................................................................... 90

Table 73: Physical Security Sensor Typical Characteristics ....................................................... 97

Table 74: Physical Security Sensor Event Trigger Offset – Next Steps ..................................... 98

Table 75: FP (NMI) Interrupt Sensor Typical Characteristics ..................................................... 99

Table 76: Button Sensor Typical Characteristics ...................................................................... 100

Table 77: IPMI Watchdog Sensor Typical Characteristics ........................................................ 101

Table 78: IPMI Watchdog Sensor Event Trigger Offset – Next Steps ...................................... 102

Revision 1.1 Intel order number G90620-002 ix

Page 10

List of Tables System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5

4600/2600/2400/1600/1400 Product Families

Table 79: SMI Timeout Sensor Typical Characteristics ............................................................ 102

Table 80: System Event Log Cleared Sensor Typical Characteristics ...................................... 103

Table 81: System Event – PEF Action Sensor Typical Characteristics .................................... 104

Table 82: BMC Watchdog Sensor Typical Characteristics ....................................................... 105

Table 83: BMC FW Health Sensor Typical Characteristics ...................................................... 106

Table 84: Firmware Update Status Sensor Typical Characteristics .......................................... 107

Table 85: Add-In Module Presence Sensor Typical Characteristics ......................................... 108

Table 86: MIC Status Sensors - Typical Characteristics ........................................................... 109

Table 87: HSC Backplane Temperature Sensor Typical Characteristics ................................. 111

Table 88: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps ............. 112

Table 89: Hard Disk Drive Monitoring Sensor Typical Characteristics................................ ...... 112

Table 90: Hard Disk Drive Monitoring Sensor - Event Trigger Offset – Next Steps .................. 113

Table 91: HSC Health Sensor Typical Characteristics ............................................................. 113

Table 92: ME Firmware Health Event Sensor Typical Characteristics...................................... 115

Table 93: ME Firmware Health Event Sensor – Next Steps ..................................................... 116

Table 94: Node Manager Exception Sensor Typical Characteristics ........................................ 117

Table 95: Node Manager Health Event Sensor Typical Characteristics ................................... 118

Table 96: Node Manager Operational Capabilities Change Sensor Typical Characteristics .... 120

Table 97: Node Manager Alert Threshold Exceeded Sensor Typical Characteristics .............. 122

Table 98: Boot up Event Record Typical Characteristics .......................................................... 124

Table 99: Boot up OEM Event Record Typical Characteristics ................................................ 125

Table 100: Shutdown Reason Code Event Record Typical Characteristics ............................. 126

Table 101: Shutdown Reason OEM Event Record Typical Characteristics ............................. 126

Table 102: Shutdown Comment OEM Event Record Typical Characteristics .......................... 127

Table 103: Bug Check/Blue Screen – OS Stop Event Record Typical Characteristics ............ 128

Table 104: Bug Check/Blue Screen code OEM Event Record Typical Characteristics ............ 129

Table 105: Linux* Kernel Panic Event Record Characteristics ................................................. 130

Table 106: Linux* Kernel Panic String Extended Record Characteristics ................................. 131

x Intel order number G90620-002 Revision 1.1

Page 11

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families Introduction

1. Introduction

The server management hardware that is part of the Intel® Server Boards and Intel® Server Platforms serves as a vital part of the overall server management strategy. The server management hardware provides essential information to the system administrator and provides the administrator the ability to remotely control the server, even when the operating system is not running.

The Intel® Server Boards and Intel® Server Platforms offer comprehensive hardware and software based solutions. The server management features make the servers simple to manage and provide alerting on system events. From entry to enterprise systems, good overall server management is essential to reduce overall total cost of ownership.

This Troubleshooting Guide is intended to help the users better understand the events that are logged in the Baseboard Management Controllers (BMC) System Event Logs (SEL) on these Intel® Server Boards.

There is a separate User’s Guide that covers the general server management and the server management software offered on the Intel® Server Boards and Intel® Server Platforms.

Server boards currently supported by this document:

 Intel® S1400FP Server Boards  Intel® S1400SP Server Boards  Intel® S1600JP Server Boards  Intel® S2400BB Server Boards  Intel® S2400EP Server Boards  Intel® S2400GP Server Boards  Intel® S2400LP Server Boards  Intel® S2400SC Server Boards  Intel® S2600CO Server Boards  Intel® S2600CP Server Boards  Intel® S2600GZ/S2600GL Server Boards  Intel® S2600IP Server Boards  Intel® S2600JF Server Boards  Intel® S2600WP Server Boards  Intel® S4600LH Server Boards  Intel® W2600CR Workstation Boards

1.1 Purpose

The purpose of this document is to list all possible events generated by the Intel platform. It may be possible that other sources (not under our control) also generate events, which will not be described in this document.

Revision 1.1 Intel order number G90620-002 1

Page 12

Introduction System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5

4600/2600/2400/1600/1400 Product Families

1.2 Industry Standard

1.2.1 Intelligent Platform Management Interface (IPMI)

The key characteristic of the Intelligent Platform Management Interface (IPMI) is that the inventory, monitoring, logging, and recovery control functions are available independently of the main processors, BIOS, and operating system. Platform management functions can also be made available when the system is in a power-down state.

IPMI works by interfacing with the BMC, which extends management capabilities in the server system and operates independently of the main processor by monitoring the on-board instrumentation. Through the BMC, IPMI also allows administrators to control power to the server, and remotely access BIOS configuration and operating system console information.

IPMI defines a common platform instrumentation interface to enable interoperability between:

 The baseboard management controller and chassis  The baseboard management controller and systems management software  Between servers

IPMI enables the following:

 Common access to platform management information, consisting of:

- Local access from systems management software

- Remote access from LAN

- Inter-chassis access from Intelligent Chassis Management Bus

- Access from LAN, serial/modem, IPMB, PCI SMBus*, or ICMB, available even if the

processor is down

 IPMI interface isolates systems management software from hardware.  Hardware advancements can be made without impacting the systems management

software.

 IPMI facilitates cross-platform management software.

You can find more information on IPMI at the following URL:

http://www.intel.com/design/servers/ipmi

1.2.2 Baseboard Management Controller (BMC)

A baseboard management controller (BMC) is a specialized microcontroller embedded on most Intel® Server Boards. The BMC is the heart of the IPMI architecture and provides the intelligence behind intelligent platform management, that is, the autonomous monitoring and recovery features implemented directly in platform management hardware and firmware.

Different types of sensors built into the computer system report to the BMC on parameters such as temperature, cooling fan speeds, power mode, operating system status, and so on. The BMC monitors the system for critical events by communicating with various sensors on the system

2 Intel order number G90620-002 Revision 1.1

Page 13

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families Introduction

board; it sends alerts and logs events when certain parameters exceed their preset thresholds, indicating a potential failure of the system. The administrator can also remotely communicate with the BMC to take some corrective action such as resetting or power cycling the system to get a hung OS running again. These abilities save on the total cost of ownership of a system.

For Intel® Server Boards and Intel® Server Platforms, the BMC supports the industry standard IPMI 2.0 Specification, enabling you to configure, monitor, and recover systems remotely.

1.2.2.1 System Event Log (SEL)

The BMC provides a centralized, non-volatile repository for critical, warning, and informational system events called the System Event Log or SEL. By having the BMC manage the SEL and logging functions, it helps to ensure that “post-mortem” logging information is available if a failure occurs that disables the system processor(s).

The BMC allows access to SEL from in-band and out-of-band mechanisms. There are various tools and utilities that can be used to access the SEL. There is the Intel® SELView utility and multiple open sourced IPMI tools.

1.2.3 Intel

Intelligent Power Node Manager Version 2.0

Intel® Intelligent Power Node Manager Version 2.0 (NM) is a platform-resident technology that enforces power and thermal policies for the platform. These policies are applied by exploiting subsystem knobs (such as processor P and T states) that can be used to control power consumption. Intel® Intelligent Power Node Manager enables data center power and thermal management by exposing an external interface to management software through which platform policies can be specified. It also enables specific data center power management usage models such as power limiting.

The configuration and control commands are used by the external management software or BMC to configure and control the Intel® Intelligent Power Node Manager feature. Because Platform Services firmware does not have any external interface, external commands are first received by the BMC over LAN and then relayed to the Platform Services firmware over IPMB channel. The BMC acts as a relay and the transport conversion device for these commands. For simplicity, the commands from the management console might be encapsulated in a generic CONFIG packet format (configuration data length, configuration data blob) to the BMC so that the BMC doesn’t even have to parse the actual configuration data.

The BMC provides the access point for remote commands from external management SW and generates alerts to them. Intel® Intelligent Power Node Manager on Intel® Manageability Engine (Intel® ME) is an IPMI satellite controller. A mechanism exists to forward commands to Intel® ME and then sends the response back to originator. Similarly events from Intel® ME will be sent as alerts outside of the BMC.

Revision 1.1 Intel order number G90620-002 3

Page 14

Basic Decoding of a SEL Record

Byte

Field

Description

1 2 Record ID

(RID)

ID used for SEL Record access.

Record Type (RT)

[7:0] – Record Type 02h = System event record C0h-DFh = OEM timestamped, bytes 8-16 OEM defined (See Table 3) E0h-FFh = OEM non-timestamped, bytes 4-16 OEM defined (See Table 4)

4 5 6 7

Timestamp (TS)

Time when event was logged. LS byte first. Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010

23:20:09 UTC Note: There are various websites that will convert the raw number to a date/time.

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

2. Basic Decoding of a SEL Record

The System Event Log (SEL) record format is defined in the IPMI Specification. The following section provides a basic definition for each of the fields in a SEL. For more details see the IPMI Specification.

The definitions for the standard SEL can be found in Table 1. The definitions for the OEM defined event logs can be found in Table 3 and Table 4.

2.1 Default Values in the SEL Records

Unless otherwise noted in the event record descriptions the following are the default values in all SEL entries.

 Byte [3] = Record Type (RT) = 02h = System event record  Byte [9:8] = Generator ID = 0020h = BMC Firmware  Byte [10] = Event Message Revision (ER) = 04h = IPMI 2.0

4 Intel order number G90620-002 Revision 1.1

Table 1. SEL Record Format

Page 15

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

8 9 Generator ID

(GID)

RqSA and LUN if event was generated from IPMB. Software ID if event was generated from system software. Byte 1 [7:1] – 7-bit I2C Slave Address, or 7-bit system software ID [0] 0b = ID is IPMB Slave Address

1b = System software ID

Software ID values:

 0001h – BIOS POST for POST errors, RAS Configuration/State,

Timestamp Synch, OS Boot events

 0033h – BIOS SMI Handler  0020h – BMC Firmware  002Ch – ME Firmware  0041h – Server Management Software  00C0h – HSC Firmware – HSBP A  00C2h – HSC Firmware – HSBP B

Byte 2 [7:4] – Channel number. Channel that event message was received over. 0h if the event

message was received from the system interface, primary IPMB, or internally generated by the BMC.

[3:2] – Reserved. Write as 00b. [1:0] – IPMB device LUN if byte 1 holds Slave Address. 00b otherwise.

EvM Rev (ER)

Event Message format version. 04h = IPMI v2.0; 03h = IPMI v1.0

Sensor Type (ST)

Sensor Type Code for sensor that generated the event

Sensor # (SN)

Number of sensor that generated the event (From SDR)

Event Dir | Event Type (EDIR)

Event Dir [7] – 0b = Assertion event.

1b = Deassertion event. Event Type Type of trigger for the event, for example, critical threshold going high, state asserted,

and so on. Also indicates class of the event. For example, discrete, threshold, or OEM. The Event Type field is encoded using the Event/Reading Type Code.

Basic Decoding of a SEL Record

Revision 1.1 Intel order number G90620-002 5

Page 16

Basic Decoding of a SEL Record

Byte

Field

Description

[6:0] – Event Type Codes

01h = Threshold (States = 0x00-0x0b) 02h-0ch = Discrete 6Fh = Sensor-Specific 70-7Fh = OEM

Event Data 1 (ED1)

Per Table 2

Event Data 2 (ED2)

Event Data 3 (ED3)

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

6 Intel order number G90620-002 Revision 1.1

Page 17

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Class

Event Data

Threshold

Event Data 1 [7:6] – 00b = Unspecified Event Data 2

01b = Trigger reading in Event Data 2 10b = OEM code in Event Data 2 11b = Sensor-specific event extension code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

01b = Trigger threshold value in Event Data 3 10b = OEM code in Event Data 3 11b = Sensor-specific event extension code in Event Data 3

[3:0] – Offset from Event/Reading Code for threshold event. Event Data 2 – Reading that triggered event, FFh or not present if unspecified. Event Data 3 – Threshold value that triggered event, FFh or not present if unspecified. If present, Event Data 2 must be present.

discrete

Event Data 1 [7:6] – 00b = Unspecified Event Data 2

01b = Previous state and/or severity in Event Data 2 10b = OEM code in Event Data 2 11b = Sensor-specific event extension code in Event Data 2

[5:4] – 00b = Unspecified Event Data 3

01b = Reserved 10b = OEM code in Event Data 3 11b = Sensor-specific event extension code in Event Data 3

[3:0] – Offset from Event/Reading Code for discrete event state Event Data 2 [7:4] – Optional offset from “Severity” Event/Reading Code (0Fh if unspecified). [3:0] – Optional offset from Event/Reading Type Code for previous discrete event state (0Fh if unspecified). Event Data 3 – Optional OEM code. FFh or not present if unspecified.

OEM

Event Data 1 [7:6] – 00b = Unspecified in Event Data 2

01b = Previous state and/or severity in Event Data 2 10b = OEM code in Event Data 2

Basic Decoding of a SEL Record

Table 2: Event Request Message Event Data Field Contents

Revision 1.1 Intel order number G90620-002 7

Page 18

Basic Decoding of a SEL Record

Sensor

Class

Event Data

11b = Reserved

[5:4] – 00b = Unspecified Event Data 3

01b = Reserved 10b = OEM code in Event Data 3 11b = Reserved

[3:0] – Offset from Event/Reading Type Code Event Data 2 [7:4] – Optional OEM code bits or offset from “Severity” Event/Reading Type Code (0Fh if unspecified). [3:0] – Optional OEM code or offset from Event/Reading Type Code for previous event state (0Fh if unspecified). Event Data 3 – Optional OEM code. FFh or not present if unspecified.

Byte

Field

Description

1 2 Record ID

(RID)

ID used for SEL Record access.

Record Type (RT)

[7:0] – Record Type C0h-DFh = OEM timestamped, bytes 8-16 OEM defined

4 5 6 7

Timestamp (TS)

Time when event was logged. LS byte first. Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010

23:20:09 UTC Note: There are various websites that will convert the raw number to a date/time.

8 9

Manufacturer ID

LS Byte first. The manufacturer ID is a 20-bit value that is derived from the IANA “Private Enterprise” ID.

Most significant four bits = Reserved (0000b). 000000h = Unspecified. 0FFFFFh = Reserved. This value is binary encoded. For example the ID for the IPMI forum is 7154 decimal, which is 1BF2h, which will be

stored in this record as F2h, 1Bh, 00h for bytes 8 through 10, respectively.

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Table 3: OEM SEL Record (Type C0h-DFh)

8 Intel order number G90620-002 Revision 1.1

Page 19

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

11 12 13 14 15 16

OEM Defined

OEM Defined. This is defined according to the manufacturer identified by the Manufacturer ID field.

Byte

Field

Description

1 2 Record ID

(RID)

ID used for SEL Record access.

Record Type (RT)

[7:0] – Record Type E0h-FFh = OEM system event record

4 5 6 7 8

9 10 11 12 13 14 15 16

OEM

OEM Defined. This is defined by the system integrator.

Basic Decoding of a SEL Record

Table 4: OEM SEL Record (Type E0h-FFh)

Revision 1.1 Intel order number G90620-002 9

Page 20

Basic Decoding of a SEL Record

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

2.2 Notes on SEL Logs and Collecting SEL Information

Whenever you capture the SEL log, you should always collect both the text/human readable version and the hex version. Because some of the data is OEM-specific, some utilities cannot decode the information correctly. In addition with some OEM-specific data there may be additional variables that are not decoded at all.

An example of not decoding all of the information is the BIOS timestamp synchronization event log. This event can be logged by the BIOS during POST or it can be logged by the BIOS SMI Handler when a system is requested to do a shutdown or a restart from the operating system (OS). See section 2.2.1 for examples. Most utilities report this as just a BIOS event and do not differentiate between the two. But sometimes it is useful because you can see the sequence of events better. For example if there are multiple sequences of the timestamp synchronization events, was the power lost after booting to the OS and then the system restarted, was it multiple POST events, or was it a restart from the OS?

An example of not decoding all the information is with the PCI Express* errors and some of the Power Supply events. For the PCI Express* errors the type of error and the PCI Bus, Device, and Function are all a part of Event Data 1 through Event Data 3. See section 2.2.2. For the Power Supply events when there is a failure, predictive failure, or a configuration error, Event Data 2 and Event Data 3 hold additional information that describes the Power Supplies PMBus* Command Registers and values for that particular event. See section 2.2.3.

2.2.1 Examples of Decoding BIOS Timestamp Events

The following are some samples of BIOS timestamp events during POST and during an OS shutdown.

2.2.1.1 BIOS POST Timestamp Events

RID[19][01] RT[02] TS[57][49][6A][4E] GID[01][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[00] ED3[FF]

RID[1A][01] RT[02] TS[57][49][6A][4E] GID[01][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[80] ED3[FF]

10 Intel order number G90620-002 Revision 1.1

RID (Record ID) = 0119h RT (Record Type) = 02h = system event record TS (Timestamp) = 4E6A4957h GID (Generator ID = 0001h = BIOS POST ER (Event Message Revision) = 04 = IPMI v2.0 ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes) SN (Sensor Number = 83h EDIR (Event Direction/Event Type) = 6fh; [7] = 0 = Assertion Event

[6:0] = 6fh = Sensor specific ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization ED2 (Event Data 2) = 00h = First in pair

Page 21

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Basic Decoding of a SEL Record

RID (Record ID) = 011Ah RT (Record Type) = 02h = system event record TS (Timestamp) = 4E6A4957h GID (Generator ID = 0001h = BIOS POST ER (Event Message Revision) = 04 = IPMI v2.0 ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes) SN (Sensor Number = 83h EDIR (Event Direction/Event Type) = 6fh; [7] = 0 = Assertion Event

ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization ED2 (Event Data 2) = 80h = Second in pair

[6:0] = 6fh = Sensor specific

2.2.1.2 BIOS SMI Handler Timestamp Events

RID[1F][00] RT[02] TS[C3][70][8D][4F] GID[33][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[00] ED3[FF]

RID[20][00] RT[02] TS[C4][70][8D][4F] GID[33][00] ER[04] ST[12] SN[83] EDIR[6F] ED1[05] ED2[80] ED3[FF]

RID (Record ID) = 001Fh RT (Record Type) = 02h = system event record TS (Timestamp) = 4F8D70C3h GID (Generator ID = 0033h = BIOS SMI Handler ER (Event Message Revision) = 04 = IPMI v2.0 ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes) SN (Sensor Number = 83h EDIR (Event Direction/Event Type) = 6Fh; [7] = 0 = Assertion Event

[6:0] = 6fh = Sensor specific ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization ED2 (Event Data 2) = 00h = First in pair

RID (Record ID) = 0020h RT (Record Type) = 02h = system event record TS (Timestamp) = 4F8D70C4h GID (Generator ID = 0033h = BIOS SMI Handler ER (Event Message Revision) = 04 = IPMI v2.0 ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes) SN (Sensor Number = 83h EDIR (Event Direction/Event Type) = 6fh; [7] = 0 = Assertion Event

[6:0] = 6fh = Sensor specific ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization ED2 (Event Data 2) = 00h = First in pair

2.2.2 Example of Decoding a PCI Express* Correctable Error Events

The following is an example of decoding a PCI Express* correctable error event. For this particular event it recorded a receiver error on Bus 0, Device 2, and Function 2. Note that correctable errors are acceptable and normal at a low rate of occurrence.

RID[27][00] RT[02] TS[0A][9B][2E][50] GID[33][00] ER[04] ST[13] SN[05] EDIR[71] ED1[A0] ED1[00] ED3[12]

RID (Record ID) = 0027h

Revision 1.1 Intel order number G90620-002 11

Page 22

Basic Decoding of a SEL Record

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

RT (Record Type) = 02h = system event record TS (Timestamp) = 502E9B0Ah GID (Generator ID = 0033h = BIOS SMI Handler ER (Event Message Revision) = 04 = IPMI v2.0 ST (Sensor Type) = 13h = Critical Interrupt (From IPMI Specification Table 42-3, Sensor Type Codes) SN (Sensor Number = 05h EDIR (Event Direction/Event Type) = 71h; [7] = 0 = Assertion Event

ED1 (Event Data 1) = A0h; [7:6] = 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Event Trigger Offset = 0h = Receiver Error ED2 (Event Data 2) = 00h; PCI Bus number = 0 ED3 (Event Data 3) = 12h; [7:3] – PCI Device number = 02h

[2:0] – PCI Function number = 2

[6:0] = 71h = OEM Specific for PCI Express* correctable errors

2.2.3 Example of Decoding a Power Supply Predictive Failure Event

The following is an example of decoding a Power Supply predictive failure event. For this example power supply 1 saw an A/C power loss event with both the input under-voltage warning and fault events getting set. In most cases this means that the A/C power spiked under the minimum warning and fault thresholds for over 20 milliseconds but the system remained powered on. If these events continue to occur, it is advisable to check your power source.

RID[5D][00] RT[02] TS[D3][B1][AE][4E] GID[20][00] ER[04] ST[08] SN[50] EDIR[6F] ED1[A2] ED2[06] ED3[30]

RID (Record ID) = 005Dh RT (Record Type) = 02h = system event record TS (Timestamp) = 4EAEB1D3h GID (Generator ID = 0020h = BMC ER (Event Message Revision) = 04 = IPMI v2.0 ST (Sensor Type) = 08h = Power Supply (From IPMI Specification Table 42-3, Sensor Type Codes) SN (Sensor Number = 50h = Power Supply 1 EDIR (Event Direction/Event Type) = 6Fh; [7] = 0 = Assertion Event

[6:0] = 6fh = Sensor specific

ED1 (Event Data 1) = A2h; [7:6] = 10b = OEM code in Event Data 2

[5:4] – 10b = OEM code in Event Data 3

[3:0] – Event Trigger Offset = 2h = Predictive Failure ED2 (Event Data 2) = 06h = Input under-voltage warning ED3 (Event Data 3) = 30h; From PMBus* Specification STATUS_INPUT command

[5] – VIN_UV_WARNING (Input Under-voltage Warning) = 1

[4] – VIN_UV_FAULT (Input Under-voltage Fault) = 1

12 Intel order number G90620-002 Revision 1.1

Page 23

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Details Section

Next Steps

01h

Power Unit Status (Pwr Unit Status)

Power Unit Status Sensor

Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps

02h

Power Unit Redundancy (Pwr Unit Redund)

Power Unit Redundancy Sensor

Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps

03h

IPMI Watchdog (IPMI Watchdog)

IPMI Watchdog

Table 78: IPMI Watchdog Sensor Event Trigger Offset – Next Steps

04h

Physical Security (Physical Scrty)

Physical Security

Table 74: Physical Security Sensor Event Trigger Offset – Next Steps

05h

FP Interrupt (FP NMI Diag Int)

FP (NMI) Interrupt

FP (NMI) Interrupt – Next Steps

06h

SMI Timeout (SMI Timeout)

SMI Timeout

SMI Timeout – Next Steps

07h

System Event Log (System Event Log)

System Event Log Cleared

Not applicable

08h

System Event (System Event)

System Event – PEF Action

System Event – PEF Action – Next Steps

09h

Button Sensor (Button)

Button Sensor

Not applicable

Sensor Cross Reference List

3. Sensor Cross Reference List

This section contains a cross reference to help find details on any specific SEL entry.

3.1 BMC owned Sensors (GID = 0020h)

The following table can be used to find the details of sensors owned by the BMC.

Table 5: BMC owned Sensors

Revision 1.1 Intel order number G90620-002 13

Page 24

Sensor Cross Reference List

Sensor

Number

Sensor Name

Details Section

Next Steps

0Ah

BMC Watchdog (BMC Watchdog)

BMC Watchdog Sensor

BMC Watchdog Sensor – Next Steps

0Bh

Voltage Regulator Watchdog (VR Watchdog)

Voltage Regulator Watchdog Timer Sensor

Voltage Regulator Watchdog Timer Sensor – Next Steps

0Ch

Fan Redundancy (Fan Redundancy)

Fan Presence and Redundancy Sensors

Table 34: Fan Redundancy Sensor – Event Trigger Offset – Next Steps

0Dh

SSB Thermal Trip (SSB Thermal Trip)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

0Eh

IO Module Presence (IO Mod Presence)

Add-In Module Presence Sensor

Add-In Module Presence – Next Steps

0Fh

SAS Module Presence (SAS Mod Presence)

Add-In Module Presence Sensor

Add-In Module Presence – Next Steps

10h

BMC Firmware Health (BMC FW Health)

BMC FW Health Sensor

BMC FW Health Sensor – Next Steps

11h

System Airflow (System Airflow)

System Air Flow Monitoring Sensor

Not applicable

12h

Firmware Update Status (FW Update Status)

Firmware Update Status Sensor

Not applicable

13h

IO Module2 Presence (IO Mod2 Presence)

Add-In Module Presence Sensor

Add-In Module Presence – Next Steps

14h

Baseboard Temperature 5 (Platform Specific)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

15h

Baseboard Temperature 6 (Platform Specific)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

16h

IO Module2 Temperature (I/O Mod2 Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

17h

PCI Riser 3 Temperature (PCI Riser 3 Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

14 Intel order number G90620-002 Revision 1.1

Page 25

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Details Section

Next Steps

18h

PCI Riser 4 Temperature (PCI Riser 4 Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

19h

Baseboard +1.05V Processor3 Vccp

(BB +1.05Vccp P3)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

1Ah

Baseboard +1.05V Processor4 Vccp

(BB +1.05Vccp P4)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

20h

Baseboard Temperature 1 (Platform Specific)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

21h

Front Panel Temperature (Front Panel Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

22h

SSB Temperature (SSB Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

23h

Baseboard Temperature 2 (Platform Specific)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

24h

Baseboard Temperature 3 (Platform Specific)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

25h

Baseboard Temperature 4 (Platform Specific)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

26h

IO Module Temperature (I/O Mod Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

27h

PCI Riser 1 Temperature (PCI Riser 1 Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

28h

IO Riser Temperature (IO Riser Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

29h–2Bh

Hot-Swap Back Plane 1-3 Temperature

(HSBP 1-3 Temp)

HSC Backplane Temperature Sensor

Table 88: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps

Sensor Cross Reference List

Revision 1.1 Intel order number G90620-002 15

Page 26

Sensor Cross Reference List

Sensor

Number

Sensor Name

Details Section

Next Steps

2Ch

PCI Riser 2 Temperature (PCI Riser 2 Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

2Dh

SAS Module Temperature (SAS Mod Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

2Eh

Exit Air Temperature (Exit Air Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

2Fh

Network Interface Controller Temperature

(LAN NIC Temp)

Threshold-based Temperature Sensors

Table 37: Temperature Sensors – Next Steps

30h–3Fh

Fan Tachometer Sensors (Chassis specific sensor names)

Fan Tachometer Sensors

Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps

40h–4Fh

Fan Present Sensors (Fan x Present)

Fan Presence and Redundancy Sensors

Table 32: Fan Presence Sensors – Event Trigger Offset – Next Steps

50h

Power Supply 1 Status (PS1 Status)

Power Supply Status Sensors

Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps

51h

Power Supply 2 Status (PS2 Status)

Power Supply Status Sensors

Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps

54h

Power Supply 1 AC Power Input (PS1 Power In)

Power Supply Power In Sensors

Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next Steps

55h

Power Supply 2 AC Power Input (PS2 Power In)

Power Supply Power In Sensors

Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next Steps

58h

Power Supply 1 +12V % of Maximum Current Output

(PS1 Curr Out %)

Power Supply Current Out % Sensors

Table 25: Power Supply Current Out % Sensor – Event Trigger Offset – Next Steps

59h

Power Supply 2 +12V % of Maximum Current Output

(PS2 Curr Out %)

Power Supply Current Out % Sensors

Table 25: Power Supply Current Out % Sensor – Event Trigger Offset – Next Steps

5Ch

Power Supply 1 Temperature (PS1 Temperature)

Power Supply Temperature Sensors

Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

16 Intel order number G90620-002 Revision 1.1

Page 27

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Details Section

Next Steps

5Dh

Power Supply 2 Temperature (PS2 Temperature)

Power Supply Temperature Sensors

Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps

60h-68h

Hard Disk Drive 15 – 23 Status (HDD 15 – 23 Status)

Hard Disk Drive Monitoring Sensor

Table 90: Hard Disk Drive Monitoring Sensor - Event Trigger Offset – Next Steps

69h-6Bh

Hot-Swap Controller 1-3 Status (HSC1 – 3 Status)

Hot-Swap Controller Health Sensor

HSC Health Sensor – Next Steps

70h

Processor 1 Status (P1 Status)

Processor Status Sensor

Table 48: Processor Status Sensors – Next Steps

71h

Processor 2 Status (P2 Status)

Processor Status Sensor

Table 48: Processor Status Sensors – Next Steps

72h

Processor 3 Status (P3 Status)

Processor Status Sensor

Table 48: Processor Status Sensors – Next Steps

73h

Processor 4 Status (P4 Status)

Processor Status Sensor

Table 48: Processor Status Sensors – Next Steps

74h

Processor 1 Thermal Margin (P1 Therm Margin)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

75h

Processor 2 Thermal Margin (P2 Therm Margin)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

76h

Processor 3 Thermal Margin (P3 Therm Margin)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

77h

Processor 4 Thermal Margin (P4 Therm Margin)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

78h-7Bh

Processor 1 – 3 Thermal Control % (P1 – P4 Therm Ctrl %)

Processor Thermal Control Sensors

Processor Thermal Control % Sensors – Next Steps

7Ch

Processor 1 ERR2 Timeout (P1 ERR2)

Processor ERR2 Timeout Sensor

Processor ERR2 Timeout – Next Steps

7Dh

Processor 2 ERR2 Timeout (P2 ERR2)

Processor ERR2 Timeout Sensor

Processor ERR2 Timeout – Next Steps

Sensor Cross Reference List

Revision 1.1 Intel order number G90620-002 17

Page 28

Sensor Cross Reference List

Sensor

Number

Sensor Name

Details Section

Next Steps

7Eh

Processor 3 ERR2 Timeout (P3 ERR2)

Processor ERR2 Timeout Sensor

Processor ERR2 Timeout – Next Steps

7Fh

Processor 4 ERR2 Timeout (P4 ERR2)

Processor ERR2 Timeout Sensor

Processor ERR2 Timeout – Next Steps

80h

Catastrophic Error (CATERR)

Catastrophic Error Sensor

Table 50: Catastrophic Error Sensor – Event Data 2 Values – Next Steps

81h

Processor 1 MSID Mismatch (P1 MSID Mismatch)

Processor MSID Mismatch Sensor

Processor MSID Mismatch Sensor – Next Steps

82h

Processor Population Fault (CPU Missing)

CPU Missing Sensor

CPU Missing Sensor – Next Steps

83h-86h

Processor 1 – 4 DTS Thermal Margin

(P1 – P4 DTS Therm Mgn)

Processor DTS Thermal Margin Sensors

Not applicable

87h

Processor 2 MSID Mismatch (P2 MSID Mismatch)

Processor MSID Mismatch Sensor

Processor MSID Mismatch Sensor – Next Steps

88h

Processor 3 MSID Mismatch (P3 MSID Mismatch)

Processor MSID Mismatch Sensor

Processor MSID Mismatch Sensor – Next Steps

89h

Processor 4 MSID Mismatch (P4 MSID Mismatch)

Processor MSID Mismatch Sensor

Processor MSID Mismatch Sensor – Next Steps

90h

Processor 1 VRD Temp (P1 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

91h

Processor 2 VRD Temp (P2 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

92h

Processor 3 VRD Temp (P3 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

93h

Processor 4 VRD Temp (P4 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

18 Intel order number G90620-002 Revision 1.1

Page 29

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Details Section

Next Steps

94h

Processor 1 Memory VRD Hot 0-1 (P1 Mem01 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

95h

Processor 1 Memory VRD Hot 2-3 (P1 Mem23 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

96h

Processor 2 Memory VRD Hot 0-1 (P2 Mem01 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

97h

Processor 2 Memory VRD Hot 2-3 (P2 Mem23 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

98h

Processor 3 Memory VRD Hot 0-1 (P3 Mem01 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

99h

Processor 3 Memory VRD Hot 2-3 (P4 Mem23 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

9Ah

Processor 4 Memory VRD Hot 0-1 (P4 Mem01 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

9Bh

Processor 4 Memory VRD Hot 2-3 (P4 Mem23 VRD Hot)

Discrete Thermal Sensors

Table 45: Discrete Thermal Sensors – Next Steps

A0h

Power Supply 1 Fan Tachometer 1 (PS1 Fan Tach 1)

Power Supply Fan Tachometer Sensors

Power Supply Fan Tachometer Sensors – Next Steps

A1h

Power Supply 1 Fan Tachometer 2 (PS1 Fan Tach 2)

Power Supply Fan Tachometer Sensors

Power Supply Fan Tachometer Sensors – Next Steps

A2h

Intel® Xeon Phi™ Coprocessor Status 1

(MIC 1 Status)

Intel® Xeon Phi™ Coprocessor (MIC) Status Sensors

Intel® Xeon Phi™ Coprocessor (MIC) Status Sensors Next Steps

A3h

Intel® Xeon Phi™ Coprocessor Status 2

(MIC 2 Status)

Intel® Xeon Phi™ Coprocessor (MIC) Status Sensors

Intel® Xeon Phi™ Coprocessor (MIC) Status Sensors Next Steps

A4h

Power Supply 2 Fan Tachometer 1 (PS2 Fan Tach 1)

Power Supply Fan Tachometer Sensors

Power Supply Fan Tachometer Sensors – Next Steps

Sensor Cross Reference List

Revision 1.1 Intel order number G90620-002 19

Page 30

Sensor Cross Reference List

Sensor

Number

Sensor Name

Details Section

Next Steps

A5h

Power Supply 2 Fan Tachometer 2 (PS2 Fan Tach 2)

Power Supply Fan Tachometer Sensors

Power Supply Fan Tachometer Sensors – Next Steps

A6h

Intel® Xeon Phi™ Coprocessor Status 3

(MIC 3 Status)

Intel® Xeon Phi™ Coprocessor (MIC) Status Sensors

Intel® Xeon Phi™ Coprocessor (MIC) Status Sensors Next Steps

A7h

Intel® Xeon Phi™ Coprocessor Status 4

(MIC 4 Status)

Intel® Xeon Phi™ Coprocessor (MIC) Status Sensors

Intel® Xeon Phi™ Coprocessor (MIC) Status Sensors Next Steps

B0h

Processor 1 DIMM Aggregate Thermal Margin 1

(P1 DIMM Thrm Mrgn1)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

B1h

Processor 1 DIMM Aggregate Thermal Margin 2

(P1 DIMM Thrm Mrgn2)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

B2h

Processor 2 DIMM Aggregate Thermal Margin 1

(P2 DIMM Thrm Mrgn1)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

B3h

Processor 2 DIMM Aggregate Thermal Margin 2

(P2 DIMM Thrm Mrgn2)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

B4h

Processor 3 DIMM Aggregate Thermal Margin 1

(P3 DIMM Thrm Mrgn1)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

B5h

Processor 3 DIMM Aggregate Thermal Margin 2

(P3 DIMM Thrm Mrgn2)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

B6h

Processor 4 DIMM Aggregate Thermal Margin 1

(P4 DIMM Thrm Mrgn1)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

20 Intel order number G90620-002 Revision 1.1

Page 31

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Details Section

Next Steps

B7h

Processor 4 DIMM Aggregate Thermal Margin 2

(P4 DIMM Thrm Mrgn2)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

B8h

Node Auto-Shutdown Sensor (Auto Shutdown)

Node Auto Shutdown Sensor

Node Auto Shutdown Sensor – Next Steps

BAh-BFh

Fan Tachometer Sensors (Chassis specific sensor names)

Fan Tachometer Sensors

Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps

C0h-C3h

Processor 1 – 4 DIMM Thermal Trip

(P1 – P4 Mem Thrm Trip)

DIMM Thermal Trip Sensors

DIMM Thermal Trip Sensors – Next Steps

C4h

Intel® Xeon Phi™ Coprocessor Thermal Margin 1

(MIC 1 Margin)

Intel® Xeon Phi™ Coprocessor (MIC) Thermal Margin Sensors

Not applicable

C5h

Intel® Xeon Phi™ Coprocessor Thermal Margin 2

(MIC 2 Margin)

Intel® Xeon Phi™ Coprocessor (MIC) Thermal Margin Sensors

Not applicable

C6h

Intel® Xeon Phi™ Coprocessor Thermal Margin 3

(MIC 3 Margin)

Intel® Xeon Phi™ Coprocessor (MIC) Thermal Margin Sensors

Not applicable

C7h

Intel® Xeon Phi™ Coprocessor Thermal Margin 4

(MIC 4 Margin)

Intel® Xeon Phi™ Coprocessor (MIC) Thermal Margin Sensors

Not applicable

C8h-CFh

Global Aggregate Temperature Margin 1 -8

(Agg Therm Mrgn 1 – 8)

Thermal Margin Sensors

Table 40: Thermal Margin Sensors – Next Steps

D0h

Baseboard +12V (BB +12.0V)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

D1h

Baseboard +5V (BB +5.0V)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

Sensor Cross Reference List

Revision 1.1 Intel order number G90620-002 21

Page 32

Sensor Cross Reference List

Sensor

Number

Sensor Name

Details Section

Next Steps

D2h

Baseboard +3.3V (BB +3.3V)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

D3h

Baseboard +5V Stand-by (BB +5.0V STBY)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

D4h

Baseboard +3.3V Auxiliary (BB +3.3V AUX)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

D6h

Baseboard +1.05V Processor1 Vccp

(BB +1.05Vccp P1)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

D7h

Baseboard +1.05V Processor2 Vccp

(BB +1.05Vccp P2)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

D8h

Baseboard +1.5V P1 Memory AB VDDQ

(BB +1.5 P1MEM AB)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

D9h

Baseboard +1.5V P1 Memory CD VDDQ

(BB +1.5 P1MEM CD)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

DAh

Baseboard +1.5V P2 Memory AB VDDQ

(BB +1.5 P2MEM AB)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

DBh

Baseboard +1.5V P2 Memory CD VDDQ

(BB +1.5 P2MEM CD)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

DCh

Baseboard +1.8V Aux (BB +1.8V AUX)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

DDh

Baseboard +1.1V Stand-by (BB +1.1V STBY)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

DEh

Baseboard CMOS Battery (BB +3.3V Vbat)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

22 Intel order number G90620-002 Revision 1.1

Page 33

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Details Section

Next Steps

E4h

Baseboard +1.35V P1 Low Voltage Memory AB VDDQ

(BB +1.35 P1LV AB)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

E5h

Baseboard +1.35V P1 Low Voltage Memory CD VDDQ

(BB +1.35 P1LV CD)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

E6h

Baseboard +1.35V P2 Low Voltage Memory AB VDDQ

(BB +1.35 P2LV AB)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

E7h

Baseboard +1.35V P2 Low Voltage Memory CD VDDQ

(BB +1.35 P2LV CD)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

EAh

Baseboard +3.3V Riser 1 Power Good

(BB +3.3 RSR1 PGD)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

EBh

Baseboard +3.3V Riser 2 Power Good

(BB +3.3 RSR2 PGD)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

ECh

Baseboard +0.9V (BB 0.9V Core IB)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

EDh

Baseboard +1.8V (BB 1.8V IB I/O)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

EEh

Baseboard +1.1V (BB 1.1V PCH)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

EFh

Baseboard +1.2V (BB +1.2V IB)

Threshold-based Voltage Sensors

Table 13: Threshold-based Voltage Sensors – Next Steps

F0h-FEh

Hard Disk Drive 0 -14 Status (HDD 0 – 14 Status)

Hard Disk Drive Monitoring Sensor

Table 90: Hard Disk Drive Monitoring Sensor - Event Trigger Offset – Next Steps

Sensor Cross Reference List

Revision 1.1 Intel order number G90620-002 23

Page 34

Sensor Cross Reference List

Sensor

Number

Sensor Name

Details Section

Next Steps

02h

Memory RAS Configuration Status

Table 58: Memory RAS Configuration Status Sensor Typical Characteristics

06h

POST Error

System Firmware Progress (Formerly Post Error)

System Firmware Progress (Formerly Post Error) – Next Steps

09h

Intel® Quick Path Interface Link Width Reduced

QPI Link Width Reduced Sensor

QPI Link Width Reduced Sensor – Next Steps

12h

Memory RAS Mode Select

Not applicable

83h

System Event

System Events

Not applicable

Sensor

Number

Sensor Name

Details Section

Next Steps

01h

Mirroring Redundancy State

Mirroring Redundancy State Sensor – Next Steps

02h

Memory ECC Error

Memory Correctable and Uncorrectable ECC Error

Table 64: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps

03h

Legacy PCI Error

Legacy PCI Errors

Legacy PCI Error Sensor – Next Steps

04h

PCI Express* Fatal Error

PCI Express* Fatal Errors and Fatal Error #2

PCI Express* Fatal Error and Fatal Error #2 Sensor – Next Steps

05h

PCI Express* Correctable Error

PCI Express* Correctable Errors

PCI Express* Correctable Error Sensor – Next Steps

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

3.2 BIOS POST owned Sensors (GID = 0001h)

The following table can be used to find the details of sensors owned by BIOS POST.

Table 6: BIOS POST owned Sensors

3.3 BIOS SMI Handler owned Sensors (GID = 0033h)

The following table can be used to find the details of sensors owned by BIOS SMI Handler.

24 Intel order number G90620-002 Revision 1.1

Table 7: BIOS SMI Handler owned Sensors

Page 35

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Details Section

Next Steps

06h

Intel® Quick Path Interface Correctable Error

QPI Correctable Error Sensor

QPI Correctable Error Sensor – Next Steps

07h

Intel® Quick Path Interface Fatal Error

QPI Fatal Error and Fatal Error #2

QPI Fatal Error and Fatal Error #2 – Next Steps

11h

Sparing Redundancy State

Sparing Redundancy State Sensor – Next Steps

13h

Memory Parity Error

Memory Address Parity Error

Memory Address Parity Error Sensor – Next Steps

14h

PCI Express* Fatal Error#2 (continuation of Sensor 04h)

PCI Express* Fatal Errors and Fatal Error #2

PCI Express* Fatal Error and Fatal Error #2 Sensor – Next Steps

17h

Intel® Quick Path Interface Fatal Error #2 (continuation of Sensor 07h)

QPI Fatal Error and Fatal Error #2

QPI Fatal Error and Fatal Error #2 – Next Steps

83h

System Event

System Events

Not applicable

Sensor

Number

Sensor Name

Details Section

Next Steps

17h

ME Firmware Health Events

ME Firmware Health Event

ME Firmware Health Event – Next Steps

18h

Node Manager Exception Events Node Manager Exception Event

Node Manager Exception Event – Next Steps

19h

Node Manager Health Events

Node Manager Health Event

Node Manager Health Event – Next Steps

1Ah

Node Manager Operational Capabilities Change Events

Node Manager Operational Capabilities Change

Node Manager Operational Capabilities Change – Next Steps

1Bh

Node Manager Alert Threshold Exceeded Events

Node Manger Alert Threshold Exceeded Node Manger Alert Threshold Exceeded – Next Steps

Sensor Cross Reference List

3.4 Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch)

The following table can be used to find the details of sensors owned by the Node Manager / Management Engine (ME) firmware.

Revision 1.1 Intel order number G90620-002 25

Table 8: Management Engine Firmware owned Sensors

Page 36

Sensor Cross Reference List

Sensor Name

Record

Type

Sensor Type

Details Section

Next Steps

Boot Event 02h

1Fh = OS Boot

Table 98: Boot up Event Record Typical Characteristics

Not applicable

DCh

Not applicable

Table 99: Boot up OEM Event Record Typical Characteristics

Shutdown Event 02h

20h = OS Stop/Shutdown

Table 100: Shutdown Reason Code Event Record Typical Characteristics

Not applicable

DDh

Not applicable

Table 101: Shutdown Reason OEM Event Record Typical Characteristics Table 102: Shutdown Comment OEM Event Record Typical Characteristics

Not applicable

Bug Check/Blue Screen 02h

20h = OS Stop/Shutdown

Table 103: Bug Check/Blue Screen – OS Stop Event Record Typical Characteristics

Not applicable

DEh

Not applicable

Table 104: Bug Check/Blue Screen code OEM Event Record Typical Characteristics

Sensor Name

Record

Type

Sensor Type

Details Section

Next Steps

Linux* Kernel Panic 02h

20h = OS Stop/Shutdown

Table 105: Linux* Kernel Panic Event Record Characteristics

Not applicable

F0h

Not applicable

Table 106: Linux* Kernel Panic String Extended Record Characteristics

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

3.5 Microsoft* OS owned Events (GID = 0041)

The following table can be used to find the details of records that are owned by the Microsoft* Operating System (OS).

Table 9: Microsoft* OS owned Events

3.6 Linux* Kernel Panic Events (GID = 0021)

The following table can be used to find the details of records that can be generated when there is a Linux* Kernel panic.

26 Intel order number G90620-002 Revision 1.1

Table 10: Linux* Kernel Panic Events

Page 37

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Sensor Type

02h = Voltage

Sensor Number

See Table 13

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

Event Data 1

[7:6] – 01b = Trigger reading in Event Data 2 [5:4] – 01b = Trigger threshold in Event Data 3 [3:0] – Event Triggers as described in Table 12

Event Data 2

Reading that triggered event

Event Data 3

Threshold value that triggered event

Power Subsystems

4. Power Subsystems

The BMC monitors the power subsystem including power supplies, select onboard voltages, and related sensors.

4.1 Threshold-based Voltage Sensors

The BMC monitors the main voltage sources in the system, including the baseboard, memory, and processors, using IPMI-compliant analog/threshold sensors. Some voltages are only on specific platforms. For details check your platforms Technical Product Specification (TPS).

Note: A voltage error can be caused by the device supplying the voltage or by the device using the voltage. For each sensor it will be noted who is supplying the voltage and who is using it.

Table 11: Threshold-based Voltage Sensors Typical Characteristics

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Revision 1.1 Intel order number G90620-002 27

Page 38

Power Subsystems

Event Trigger

Assertion

Severity

Deassert

Severity

Description

Hex

Description

00h

Lower non-critical going low

Degraded

The voltage has dropped below its lower non-critical threshold.

02h

Lower critical going low

non-fatal

Degraded

The voltage has dropped below its lower critical threshold.

07h

Upper non-critical going high

Degraded

The voltage has gone over its upper non-critical threshold.

09h

Upper critical going high

non-fatal

Degraded

The voltage has gone over its upper critical threshold.

Sensor

Number

Sensor Name

Next Steps

19h

Baseboard +1.05V Processor3 Vccp (BB +1.05Vccp P3)

This 1.05V line is supplied by the main board. This 1.05V line is used by processor 1.

1. Ensure all cables are connected correctly.

2. Check the processor is seated properly.

3. Cross test the processors. If the issue remains with the processor socket, replace the main board, otherwise the processor.

1Ah

Baseboard +1.05V Processor4 Vccp (BB +1.05Vccp P4)

This 1.05V line is supplied by the main board. This 1.05V line is used by processor 1.

1. Ensure all cables are connected correctly.

2. Check the processor is seated properly.

3. Cross test the processors. If the issue remains with the processor socket, replace the main board, otherwise the processor.

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Table 12: Threshold-based Voltage Sensors Event Triggers – Description

Table 13: Threshold-based Voltage Sensors – Next Steps

28 Intel order number G90620-002 Revision 1.1

Page 39

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Next Steps

D0h

Baseboard +12V (BB +12.0V)

+12V is supplied by the power supplies. +12V is used by SATA drives, Fans, and PCI cards. In addition it is used to generate various processor

voltages.

1. Ensure all cables are connected correctly.

2. Check connections on the fans and HDDs.

3. If the issue follows the component, swap it, otherwise, replace the board.

4. If the issue remains, replace the power supplies.

D1h

Baseboard +5V (BB +5.0V)

+5.0V is supplied by the power supplies for pedestal systems, and supplied by the main board on rackoptimized systems.

+5.0V is used by the PCI slots.

1. Ensure all cables are connected correctly.

2. Reseat any PCI cards.

3. Try PCI cards in other PCI slots.

4. If the issue follows the card, swap it, otherwise, replace the main board.

5. If the issue remains, replace the power supplies.

D2h

Baseboard +3.3V (BB +3.3V)

+3.3V is supplied by the power supplies for pedestal systems, and supplied by the main board on rackoptimized systems.

+3.3V is used by the PCIe and PCI-X slots.

1. Ensure all cables are connected correctly.

2. Reseat any PCI cards.

3. Try PCI cards in other PCI slots.

4. If the issue follows the card, swap it, otherwise, replace the main board.

5. If the issue remains, replace the power supplies.

D3h

Baseboard +5V Stand-by (BB +5.0V STBY)

+5.0V STBY is supplied by the power supplies for pedestal systems, and supplied by the main board on rack-optimized systems.

+5.0V STBY is used to generate other standby voltages.

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the board.

3. If the issue remains, replace the power supplies.

Power Subsystems

Revision 1.1 Intel order number G90620-002 29

Page 40

Power Subsystems

Sensor

Number

Sensor Name

Next Steps

D4h

Baseboard +3.3V Auxiliary (BB +3.3V AUX)

+3.3V AUX is supplied by the main board. +3.3V AUX is used by the BMC, clock chips, PCI-E Slot, on-board NIC, Intel® C600 series Chipset, and

ICH.

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the board.

3. If the issue remains, replace the power supplies.

D6h

Baseboard +1.05V Processor1 Vccp (BB +1.05Vccp P1)

This 1.05V line is supplied by the main board. This 1.05V line is used by processor 1.

1. Ensure all cables are connected correctly.

2. Check the processor is seated properly.

3. Cross test the processors. If the issue remains with the processor socket, replace the main board, otherwise the processor.

D7h

Baseboard +1.05V Processor2 Vccp (BB +1.05Vccp P2)

This 1.05V line is supplied by the main board. This 1.05V line is used by processor 2.

1. Ensure all cables are connected correctly.

2. Check the processor is seated properly.

3. Cross test the processors. If the issue remains with the processor socket, replace the main board, otherwise the processor.

D8h

Baseboard +1.5V P1 Memory AB VDDQ

(BB +1.5 P1MEM AB)

This 1.5V line is supplied by the main board. This 1.5V line is used by processor 1 memory slots A and B.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.

D9h

Baseboard +1.5V P1 Memory CD VDDQ

(BB +1.5 P1MEM CD)

This 1.5V line is supplied by the main board. This 1.5V line is used by processor 1 memory slots C and D.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

30 Intel order number G90620-002 Revision 1.1

Page 41

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Next Steps

DAh

Baseboard +1.5V P2 Memory AB VDDQ

(BB +1.5 P2MEM AB)

This 1.5V line is supplied by the main board. This 1.5V line is used by processor 2 memory slots A and B.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.

DBh

Baseboard +1.5V P2 Memory CD VDDQ

(BB +1.5 P2MEM CD)

This 1.5V line is supplied by the main board. This 1.5V line is used by processor 2 memory slots C and D.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.

DCh

Baseboard +1.8V Aux (BB +1.8V AUX)

+1.8V AUX is supplied by the main board. +1.8V AUX is used by the BMC and on-board NIC.

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the board.

3. If the issue remains, replace the power supplies.

DDh

Baseboard +1.1V Stand-by (BB +1.1V STBY)

+1.1V STBY is supplied by the main board. +1.1V STBY is used by the Intel® C600 series Chipset.

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the board.

3. If the issue remains, replace the power supplies.

DEh

Baseboard CMOS Battery (BB +3.3V Vbat)

+3.3V Vbat is supplied by the CMOS battery when power is off and by the main board when power is on. +3.3V Vbat is used by the CMOS and related circuits.

1. Replace the CMOS battery. Any battery of type CR2032 can be used.

2. If error remains (unlikely), replace the board.

E4h

Baseboard +1.35V P1 Low Voltage Memory AB VDDQ

(BB +1.35 P1LV AB)

This 1.35V line is supplied by the main board. This 1.35V line is used by processor 1 memory slots A and B.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.

Power Subsystems

Revision 1.1 Intel order number G90620-002 31

Page 42

Power Subsystems

Sensor

Number

Sensor Name

Next Steps

E5h

Baseboard +1.35V P1 Low Voltage Memory CD VDDQ

(BB +1.35 P1LV CD)

This 1.35V line is supplied by the main board. This 1.35V line is used by processor 1 memory slots C and D.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.

E6h

Baseboard +1.35V P2 Low Voltage Memory AB VDDQ

(BB +1.35 P2LV AB)

This 1.35V line is supplied by the main board. This 1.35V line is used by processor 2 memory slots A and B.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.

E7h

Baseboard +1.35V P2 Low Voltage Memory CD VDDQ

(BB +1.35 P2LV CD)

This 1.35V line is supplied by the main board. This 1.35V line is used by processor 2 memory slots C and D.

1. Ensure all cables are connected correctly.

2. Check the DIMMs are seated properly.

3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.

EAh

Baseboard +3.3V Riser 1 Power Good (BB +3.3 RSR1 PGD)

+3.3V Riser 1 Power Good is supplied by Riser 1 on specific platforms. +3.3V Riser 1 Power Good is an indication of the +3.3V on Riser 1.

1. Ensure that the riser is seated correctly.

2. If issue remains, replace the riser.

3. If issue remains, replace the main board.

4. If the issue remains, replace the power supplies.

EBh

Baseboard +3.3V Riser 2 Power Good (BB +3.3 RSR2 PGD)

+3.3V Riser 2 Power Good is supplied by Riser 2 on specific platforms. +3.3V Riser 2 Power Good is an indication of the +3.3V on Riser 2.

1. Ensure that the riser is seated correctly.

2. If issue remains, replace the riser.

3. If issue remains, replace the main board.

4. If the issue remains, replace the power supplies.

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

32 Intel order number G90620-002 Revision 1.1

Page 43

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Next Steps

ECh

Baseboard +0.9V (BB 0.9V Core IB)

+0.9V Core IB is supplied by the main board on specific platforms. +0.9V Core IB is used by the on-board Infiniband* controller on those specific platforms.

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the board.

3. If the issue remains, replace the power supplies.

EDh

Baseboard +1.8V (BB 1.8V IB I/O)

+1.8V IB I/O is supplied by the main board on specific platforms. +1.8V IB I/O is used by the on-board Infiniband* controller on those specific platforms.

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the board.

3. If the issue remains, replace the power supplies.

EEh

Baseboard +1.1V (BB 1.1V PCH)

This 1.1V line is supplied by the main board. This 1.1V line is used by the Intel® C600 series Chipset.

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the board.

EFh

Baseboard +1.2V (BB +1.2V IB)

+1.2V is supplied by the main board on specific platforms. +1.2V is used by the on-board Infiniband* controller on those specific platforms.

1. Ensure all cables are connected correctly.

2. If the issue remains, replace the board.

3. If the issue remains, replace the power supplies.

Power Subsystems

4.2 Voltage Regulator Watchdog Timer Sensor

The BMC FW monitors that the power sequence for the board VR controllers is completed when a DC power-on is initiated. Incompletion of the sequence indicates a board problem, in which case the FW powers down the system.

The sequence is as follows:  BMC FW monitors the PowerSupplyPowerGood signal for assertion, indicating a DC-power-on has been initiated, and starts a

timer (VR Watchdog Timer). For EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600 Product Families this timeout is 500ms.

Revision 1.1 Intel order number G90620-002 33

Page 44

Power Subsystems

Byte

Field

Description

Sensor Type

02h = Voltage

Sensor Number

0Bh

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 03h (“digital” Discrete)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset = 1h = State Asserted

Event Data 2

Not used

Event Data 3

Not used

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

 If the SystemPowerGood signal has not asserted by the time the VR Watchdog Timer expires, the FW powers down the system,

logs a SEL entry, and emits a beep code (1-5-1-2). This failure is termed as VR Watchdog Timeout.

Table 14: Voltage Regulator Watchdog Timer Sensor Typical Characteristics

4.2.1 Voltage Regulator Watchdog Timer Sensor – Next Steps

1. Ensure that all the connectors from the power supply are well seated.

2. Cross test the baseboard. If the issue remains with the baseboard, replace the baseboard.

4.3 Power Unit

The power unit monitors the power state of the system and logs the state changes in the SEL.

4.3.1 Power Unit Status Sensor

The power unit status sensor monitors the power state of the system and logs state changes. Expected power-on events such as DC ON/OFF is logged and unexpected events are also logged, such as AC loss and power good loss.

34 Intel order number G90620-002 Revision 1.1

Page 45

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Sensor Type

09h = Power Unit

Sensor Number

01h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] = Sensor Specific offset as described in Table 16

Event Data 2

Not used

Event Data 3

Not used

Sensor Specific Offset

Description

Next Steps

Hex

Description

00h

Power down

System is powered down.

Informational Event

02h

240 VA power down

240 VA power limit was exceeded and the hardware forced a power down.

This could have been caused by many things.

1. If you recently added hardware, try removing it.

2. Remove/replace any add-in adapters.

3. Remove/replace the power supply.

4. Remove/replace the processors, DIMM, and/or hard drives.

5. Remove/replace the boards in the system.

04h

A/C Lost

A/C power was removed.

Informational Event

Table 15: Power Unit Status Sensors Typical Characteristics

Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps

Power Subsystems

Revision 1.1 Intel order number G90620-002 35

Page 46

Power Subsystems

Sensor Specific Offset

Description

Next Steps

Hex

Description

05h

Soft Power Control Failure

Generally means power good was lost in the system, causing a shutdown.

This could be cause by the power supply subsystem or system components.

1. Verify all power cables and adapters are connected properly (AC cables as well as the cables between the PSU and system components).

2. Cross test the PSU if possible.

3. Replace the power subsystem.

06h

Power Unit Failure

Power subsystem experienced a failure.

Indicates a power supply failed.

1. Remove and reapply AC power.

2. If the power supply still fails, replace it.

Byte

Field

Description

Sensor Type

09h = Power Unit

Sensor Number

02h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 0Bh (Generic Discrete)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset as described in Table 18

Event Data 2

Not used

Event Data 3

Not used

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

4.3.2 Power Unit Redundancy Sensor

This sensor is enabled on the systems that support redundant power supplies. When a system has AC applied or if it loses redundancy of the power supplies, a message will get logged into the SEL.

Table 17: Power Unit Redundancy Sensors Typical Characteristics

36 Intel order number G90620-002 Revision 1.1

Page 47

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Event Trigger Offset

Description

Next Steps

Hex

Description

00h

Fully redundant

System is fully operational.

Informational Event

01h

Redundancy lost

System is not running in redundant power supply mode.

This event is accompanied by specific power supply errors (AC lost, PSU failure, and so on). Troubleshoot these events accordingly.

02h

Redundancy degraded

03h

Non-redundant, sufficient from redundant

04h

Non-redundant, sufficient from insufficient

05h

Non-redundant, insufficient

06h

Non-redundant, degraded from fully redundant

07h

Redundant, degraded from non-redundant

Byte

Field

Description

Sensor Type

09h = Power Unit

Sensor Number

B8h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

Power Subsystems

Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps

4.3.3 Node Auto Shutdown Sensor

The BMC supports a Node Auto Shutdown sensor for logging a SEL event due to an emergency shutdown of a node due to loss of power supply redundancy or PSU CLST throttling due to an over-current warning condition. This sensor is applicable only to multinode systems.

The sensor is rearmed on power-on (AC or DC power on transitions). This sensor is only used for triggering SEL to indicate node or power auto shutdown assertion or deassertion.

Revision 1.1 Intel order number G90620-002 37

Table 19: Node Auto Shutdown Sensor Typical Characteristics

Page 48

Power Subsystems

Byte

Field

Description

[6:0] Event Type = 03h (“digital” discrete)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset

1h = State Asserted

Event Data 2

Not used

Event Data 3

Not used

Byte

Field

Description

Sensor Type

08h = Power Supply

Sensor Number

50h = Power Supply 1 Status 51h = Power Supply 2 Status

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

4.3.3.1 Node Auto Shutdown Sensor – Next Steps

This event is accompanied by specific power supply errors (AC lost, PSU failure, and so on) or other system events. Troubleshoot these events accordingly.

4.4 Power Supply

The BMC monitors the power supply subsystem.

4.4.1 Power Supply Status Sensors

These sensors report the status of the power supplies in the system. When a system first AC applied or removed , it can log an event. Also if there is a failure, predictive failure, or a configuration error, it can log an event.

38 Intel order number G90620-002 Revision 1.1

Table 20: Power Supply Status Sensors Typical Characteristics

Page 49

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

[6:0] Event Type = 6Fh (Sensor Specific)

Event Data 1

[7:6] – ED2 data in Table 21 [5:4] – ED3 data in Table 21 [3:0] – Sensor Specific offset as described in Table 21

Event Data 2

As described in Table 21

Event Data 3

As described in Table 21

Sensor Specific Offset

Description

ED2

ED3

Next Steps

Hex

Description

00h

Presence

Power supply detected

00b = Unspecified Event Data 2

00b = Unspecified Event Data 3

Informational Event

01h

Failure

Power supply failed Check the data in ED2

and ED3 for more details.

10b = OEM code in Event Data 2

 01h – Output voltage fault  02h – Output power fault  03h – Output over-current fault  04h – Over-temperature fault  05h – Fan fault

10b = OEM code in Event Data 3

Will have the contents of the associated PMBus* Status register. For example, Data 3 will have the contents of the VOLTAGE_STATUS register at the time an Output Voltage fault was detected. Refer to the PMBus* Specification for details on specific register contents.

Indicates a power supply failed.

1. Remove and reapply AC.

2. If the power supply still fails, replace it.

Table 21: Power Supply Status Sensor – Sensor Specific Offsets – Next Steps

Power Subsystems

Revision 1.1 Intel order number G90620-002 39

Page 50

Power Subsystems

Sensor Specific Offset

Description

ED2

ED3

Next Steps

Hex

Description

02h

Predictive Failure

Check the data in ED2 and ED3 for more details.

10b = OEM code in Event Data 2

 01h – Output voltage warning  02h – Output power warning  03h – Output over-current

warning

 04h –Over-temperature warning  05h – Fan warning  06h – Input under-voltage

warning

 07h – Input over-current

warning

 08h – Input over-power warning

10b = OEM code in Event Data 3

Will have the contents of the associated PMBus* Status register. For example, Data 3 will have the contents of the VOLTAGE_STATUS register at the time an Output Voltage warning was detected. Refer to the PMBus* Specification for details on specific register contents

Depends on the warning event.

1. Replace the power supply.

2. Verify proper airflow to the system.

3. Verify the power source.

4. Replace the system boards.

03h

A/C lost

AC removed

00b = Unspecified Event Data 2

00b = Unspecified Event Data 3

Informational Event.

06h

Configuration error

Power supply configuration is not supported.

Check the data in ED2 for more details.

10b = OEM code in Event Data 2  01h – The BMC cannot access

the PMBus* device on the PSU but its FRU device is responding.

 02h – The PMBUS*_REVISION

command returns a version number that is not supported (only version 1.1 and 1.2 are supported).

 03h – The PMBus* device does

not successfully respond to the PMBUS*_REVISION command.

 04h – The PSU is incompatible

with one or more PSUs that are present in the system.

 05h –The PSU FW is operating

in a degraded mode (likely due to a failed firmware update).

00b = Unspecified Event Data 3

Indicates that at least one of the supplies is not correct for your system configuration.

1. Remove the power supply and verify compatibility.

2. If the power supply is compatible, it may be faulty. Replace it.

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

40 Intel order number G90620-002 Revision 1.1

Page 51

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Sensor Type

0Bh = Other Units

Sensor Number

54h = Power Supply 1 Status 55h = Power Supply 2 Status

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 01h(Threshold)

Event Data 1

[7:6] – 01b = Trigger reading in Event Data 2 [5:4] – 01b = Trigger threshold in Event Data 3 [3:0] – Event Trigger Offset as described in Table 23

Event Data 2

Reading that triggered event

Event Data 3

Threshold value that triggered event

Event Trigger Offset

Assertion

Severity

Deassert

Severity

Description

Next Steps

Hex

Description

07h

Upper non-critical going high

Degraded

PMBus* feature to monitor power supply power consumption.

If you see this event, the system is pulling too much power on the input for the PSU rating.

1. Verify the power budget is within the specified range.

2. Check http://www.intel.com/p/en_US/support/ for the power budget tool for your system.

09h

Upper critical going high

non-fatal

Degraded

4.4.2 Power Supply Power In Sensors

These sensors will log an event when a power supply in the system is exceeding its AC power in threshold.

Table 22: Power Supply Power In Sensors Typical Characteristics

Power Subsystems

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next Steps

Revision 1.1 Intel order number G90620-002 41

Page 52

Power Subsystems

Byte

Field

Description

Sensor Type

03h = Current

Sensor Number

58h = Power Supply 1 Current Out % 59h = Power Supply 2 Current Out %

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

Event Data 1

[7:6] – 01b = Trigger reading in Event Data 2 [5:4] – 01b = Trigger threshold in Event Data 3 [3:0] – Event Trigger Offset as described in Table 25

Event Data 2

Reading that triggered event

Event Data 3

Threshold value that triggered event

Event Trigger Offset

Assertion

Severity

Deassert

Severity

Description

Next Steps

Hex

Description

07h

Upper non-critical going high

Degraded

PMBus* feature to monitor power supply power consumption.

If you see this event, the system is using too much power on the output for the PSU rating.

1. Verify the power budget is within the specified range.

2. Check http://www.intel.com/p/en_US/support/ for the power budget tool for your system.

09h

Upper critical going high

non-fatal

Degraded

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

4.4.3 Power Supply Current Out % Sensors

PMBus*-compliant power supplies may monitor the current output of the main 12v voltage rail and report the current usage as a percentage of the maximum power output for that rail.

Table 24: Power Supply Current Out % Sensors Typical Characteristics

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 25: Power Supply Current Out % Sensor – Event Trigger Offset – Next Steps

42 Intel order number G90620-002 Revision 1.1

Page 53

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Sensor Type

01h = Temperature

Sensor Number

5Ch = Power Supply 1 Temperature 5Dh = Power Supply 2 Temperature

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

Event Data 1

[7:6] – 01b = Trigger reading in Event Data 2 [5:4] – 01b = Trigger threshold in Event Data 3 [3:0] – Event Trigger Offset as described in Table 27

Event Data 2

Reading that triggered event

Event Data 3

Threshold value that triggered event

Event Trigger Offset

Assertion

Severity

Deassert

Severity

Description

Next Steps

Hex

Description

07h

Upper non-critical going high

Degraded

An upper non-critical or critical temperature threshold has been crossed.

1. Check for clear and unobstructed airflow into and out of the chassis.

2. Ensure SDR is programmed and correct chassis has been selected.

3. Ensure there are no fan failures.

4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).

09h

Upper critical going high

non-fatal

Degraded

4.4.4 Power Supply Temperature Sensors

The BMC monitors one or two power supply temperature sensors for each installed PMBus*-compliant power supply.

Table 26: Power Supply Temperature Sensors Typical Characteristics

Power Subsystems

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps

Revision 1.1 Intel order number G90620-002 43

Page 54

Power Subsystems

Byte

Field

Description

Sensor Type

04h = Fan

Sensor Number

A0h = Power Supply 1 Fan Tachometer 1 A1h = Power Supply 1 Fan Tachometer 2 A4h = Power Supply 2 Fan Tachometer 1 A5h = Power Supply 2 Fan Tachometer 2

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 03h (“digital” Discrete)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset = 1h = State Asserted

Event Data 2

Not used

Event Data 3

Not used

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

4.4.5 Power Supply Fan Tachometer Sensors

The BMC polls each installed power supply using the PMBus* fan status commands to check for failure conditions for the power supply fans.

Table 28: Power Supply Fan Tachometer Sensors Typical Characteristics

4.4.5.1 Power Supply Fan Tachometer Sensors – Next Steps

These events only get generated in the systems with PMBus*-capable power supplies and normally when the airflow is obstructed to the power supply:

1. Remove and then reinstall the power supply to see whether something might have temporarily caused the fan failure.

2. Swap the power supply with another one to see whether the problem stays with the location or follows the power supply.

3. Replace the power supply depending on the outcome of steps 1 and 2.

4. Ensure the latest FRUSDR update has been run and the correct chassis is detected or selected.

44 Intel order number G90620-002 Revision 1.1

Page 55

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Sensor Type

04h = Fan

Sensor Number

30h-3Fh (Chassis specific) BAh-BFh (Chassis specific)

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

Event Data 1

[7:6] – 01b = Trigger reading in Event Data 2 [5:4] – 01b = Trigger threshold in Event Data 3 [3:0] – Event Trigger Offset as described in Table 30

Event Data 2

Reading that triggered event

Event Data 3

Threshold value that triggered event

Cooling Subsystem

5. Cooling Subsystem

5.1 Fan Sensors

There are three types of fan sensors that can be present on Intel® Server Systems: speed, presence, and redundancy. The last two are only present in the systems with hot-swap redundant fans.

5.1.1 Fan Tachometer Sensors

Fan tachometer sensors monitor the rpm signal on the relevant fan headers on the platform. Fan speed sensors are threshold-based sensors. Usually they only have lower (critical) thresholds set, so that a SEL entry is only generated if the fan spins too slowly.

Table 29: Fan Tachometer Sensors Typical Characteristics

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Revision 1.1 Intel order number G90620-002 45

Page 56

Cooling Subsystem

Event Trigger Offset

Assertion

Severity

Deassert

Severity

Description

Next Steps

Hex

Description

00h

Lower non-critical going low

Degraded

The fan speed has dropped below its lower non-critical threshold.

A fan speed error on a new system build is typically not caused by the fan spinning too slowly, instead it is caused by the fan being connected to the wrong header (the BMC expects them on certain headers for each chassis and will log this event if there is no fan on that header).

1. Refer to the Quick Start Guide or the Service Guide to identify

the correct fan headers to use.

2. Ensure the latest FRUSDR update has been run and the correct

chassis is detected or selected.

3. If you are sure this was done, the event may be a sign of

impending fan failure (although this only normally applies if the system has been in use for a while). Replace the fan.

02h

Lower critical going low

non-fatal

Degraded

The fan speed has dropped below its lower critical threshold.

Byte

Field

Description

Sensor Type

04h = Fan

Sensor Number

40h-4Fh (Chassis specific)

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 08h (Generic “digital” Discrete)

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps

5.1.2 Fan Presence and Redundancy Sensors

Fan presence sensors are only implemented for hot-swap fans, and require an additional pin on the fan header. Fan redundancy is an aggregate of the fan presence sensors and will warn when redundancy is lost. Typically the redundancy mode on Intel® servers is an n+1 redundancy (if one fan fails there are still sufficient fans to cool the system, but it is no longer redundant) although other modes are also possible.

Table 31: Fan Presence Sensors Typical Characteristics

46 Intel order number G90620-002 Revision 1.1

Page 57

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset as described in Table 32

Event Data 2

Not used

Event Data 3

Not used

Event Trigger Offset

Assertion

Severity

Deassert

Severity

Description

Next Steps

Hex

Description

01h

Device Present

Degraded

Assertion –A fan was inserted. This event may also get logged when the BMC initializes when AC is applied.

Informational only

Deassert – A fan was removed, or was not present at the expected location when the BMC initialized.

These events only get generated in the systems with hot-swappable fans, and normally only when a fan is physically inserted or removed. If fans were not physically removed:

1. Use the Quick Start Guide to check whether the right fan

headers were used.

2. Swap the fans round to see whether the problem stays with the

location or follows the fan.

3. Replace the fan or fan wiring/housing depending on the outcome

of step 2.

4. Ensure the latest FRUSDR update has been run and the correct

chassis is detected or selected.

Byte

Field

Description

Sensor Type

04h = Fan

Sensor Number

0Ch

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 32: Fan Presence Sensors – Event Trigger Offset – Next Steps

Cooling Subsystem

Revision 1.1 Intel order number G90620-002 47

Table 33: Fan Redundancy Sensors Typical Characteristics

Page 58

Cooling Subsystem

Byte

Field

Description

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 0Bh (Generic Discrete)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset as described in Table 34

Event Data 2

Not used

Event Data 3

Not used

Event Trigger Offset

Description

Next Steps

Hex

Description

00h

Fully redundant

The system has lost one or more fans and is running in nonredundant mode. There are enough fans to keep the system properly cooled, but fan speeds will boost.

Fan redundancy loss indicates failure of one or more fans.

Look for lower (non-) critical fan errors, or fan removal errors in the SEL, to indicate which fan is causing the problem, and follow the troubleshooting steps for these event types.

01h

Redundancy lost

02h

Redundancy degraded

03h

Non-redundant, sufficient from redundant

04h

Non-redundant, sufficient from insufficient

05h

Non-redundant, insufficient

The system has lost fans and may no longer be able to cool itself adequately. Overheating may occur if this situation remains for a longer period of time.

06h

Non-redundant, degraded from fully redundant

The system has lost one or more fans and is running in nonredundant mode. There are enough fans to keep the system properly cooled, but fan speeds will boost.

07h

Redundant, degraded from non-redundant

The system has lost one or more fans and is running in a degraded mode, but still is redundant. There are enough fans to keep the system properly cooled.

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

The following table describes the severity of each of the event triggers for both assertion and deassertion.

Table 34: Fan Redundancy Sensor – Event Trigger Offset – Next Steps

48 Intel order number G90620-002 Revision 1.1

Page 59

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Sensor Type

01h = Temperature

Sensor Number

See Table 37

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

Event Data 1

[7:6] – 01b = Trigger reading in Event Data 2 [5:4] – 01b = Trigger threshold in Event Data 3 [3:0] – Event Trigger Offset as described in Table 36

Event Data 2

Reading that triggered event

Cooling Subsystem

5.2 Temperature Sensors

There are a variety of temperature sensors that can be implemented on Intel® Server Systems. They are split into various types each with their own events that can be logged.

 Threshold-based Temperature  Thermal Margin  Processor Thermal Control %  Processor DTS Thermal Margin (Monitor only)  Discrete Thermal  DIMM Thermal Trip

5.2.1 Threshold-based Temperature Sensors

Threshold-based temperature sensors are sensors that report an actual temperature. These are linear, threshold-based sensors. In most Intel® Server Systems, multiple sensors are defined: front panel temperature and baseboard temperature. There are also multiple other sensors that can be defined and are platform-specific. Most of these sensors typically have upper and lower thresholds set – upper to warn in case of an over-temperature situation, lower to warn against sensor failure (temperature sensors typically read out 0 if they stop working).

Revision 1.1 Intel order number G90620-002 49

Table 35: Temperature Sensors Typical Characteristics

Page 60

Cooling Subsystem

Byte

Field

Description

Event Data 3

Threshold value that triggered event

Event Trigger

Assertion

Severity

Deassert

Severity

Description

Hex

Description

00h

Lower non-critical going low

Degraded

The temperature has dropped below its lower non-critical threshold.

02h

Lower critical going low

non-fatal

Degraded

The temperature has dropped below its lower critical threshold.

07h

Upper non-critical going high

Degraded

The temperature has gone over its upper non-critical threshold.

09h

Upper critical going high

non-fatal

Degraded

The temperature has gone over its upper critical threshold.

Sensor

Number

Sensor Name

Next Steps

21h

Front Panel Temp

If the front panel temperature reads zero, check:

1. It is connected properly.

2. The SDR has been programmed correctly for your chassis.

If the front panel temperature is too high:

1. Check the cooling of your server room.

14h

Baseboard Temperature 5

1. Check for clear and unobstructed airflow into and out of the chassis.

2. Ensure the SDR is programmed and correct chassis has been selected.

3. Ensure there are no fan failures.

4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below

35°C).

15h

Baseboard Temperature 6

16h

I/O Mod2 Temp

17h

PCI Riser 5 Temp

18h

PCI Riser 4 Temp

20h

Baseboard Temperature 1

22h

SSB Temperature

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Table 36: Temperature Sensors Event Triggers – Description

Table 37: Temperature Sensors – Next Steps

50 Intel order number G90620-002 Revision 1.1

Page 61

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Next Steps

23h

Baseboard Temperature 2

24h

Baseboard Temperature 3

25h

Baseboard Temperature 4

26h

I/O Mod Temp

27h

PCI Riser 1 Temp

28h

IO Riser Temp

2Ch

PCI Riser 2 Temp

2Dh

SAS Mod Temp

2Eh

Exit Air Temp

2Fh

LAN NIC Temp

Byte

Field

Description

Sensor Type

01h = Temperature

Sensor Number

See Table 40

Cooling Subsystem

5.2.2 Thermal Margin Sensors

Margin sensors are also linear sensors but typically report a negative value. This is not an actual temperature, but in fact an offset to a critical temperature. Values reported are seen as number of degrees below a critical temperature for the particular component.

The BMC supports DIMM aggregate temperature margin IPMI sensors. The temperature readings from the physical temperature sensors on each DIMM (such as, Temperature Sensor on DIMM, or TSOD) are aggregated into IPMI temperature margin sensors for groupings of DIMM slots, the partitioning of which is platform/SKU specific and generally corresponding to fan domains.

The BMC supports global aggregate temperature margin IPMI sensors. There may be as many unique global aggregate sensors as there are fan domains. Each sensor aggregates the readings of multiple other IPMI temperature sensors supported by the BMC FW. The mapping of child-sensors into each global aggregate sensor is SDR-configurable. The primary usage for these sensors is to trigger turning off fans when a lower threshold is reached.

Revision 1.1 Intel order number G90620-002 51

Table 38: Thermal Margin Sensors Typical Characteristics

Page 62

Cooling Subsystem

Byte

Field

Description

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

Event Data 1

[7:6] – 01b = Trigger reading in Event Data 2 [5:4] – 01b = Trigger threshold in Event Data 3 [3:0] – Event Triggers as described in Table 39

Event Data 2

Reading that triggered event

Event Data 3

Threshold value that triggered event

Event Trigger

Assertion

Severity

Deassert

Severity

Description

Hex

Description

07h

Upper non-critical going high

Degraded

The thermal margin has gone over its upper non-critical threshold.

09h

Upper critical going high

non-fatal

Degraded

The thermal margin has gone over its upper critical threshold.

Sensor

Number

Sensor Name

Next Steps

74h

P1 Therm Margin

Not a logged SEL event. Sensor is used for thermal management of the processor.

75h

P2 Therm Margin

76h

P3 Therm Margin

77h

P4 Therm Margin

B0h

P1 DIMM Thrm Mrgn1

1. Check for clear and unobstructed airflow into and out of the chassis.

2. Ensure the SDR is programmed and correct chassis has been selected.

3. Ensure there are no fan failures.

B1h

P1 DIMM Thrm Mrgn2

B2h

P2 DIMM Thrm Mrgn1

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Table 39: Thermal Margin Sensors Event Triggers – Description

Table 40: Thermal Margin Sensors – Next Steps

52 Intel order number G90620-002 Revision 1.1

Page 63

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Next Steps

B3h

P2 DIMM Thrm Mrgn2

4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).

B4h

P3 DIMM Thrm Mrgn1

B5h

P3 DIMM Thrm Mrgn2

B6h

P4 DIMM Thrm Mrgn1

B7h

P4 DIMM Thrm Mrgn2

C8h

Agg Therm Mrgn 1

C9h

Agg Therm Mrgn 2

CAh

Agg Therm Mrgn 3

CBh

Agg Therm Mrgn 4

CCh

Agg Therm Mrgn 5

CDh

Agg Therm Mrgn 6

CEh

Agg Therm Mrgn 7

CFh

Agg Therm Mrgn 8

Byte

Field

Description

Sensor Type

01h = Temperature

Sensor Number

78h = Processor 1 Thermal Control % 79h = Processor 2 Thermal Control %

Cooling Subsystem

5.2.3 Processor Thermal Control Sensors

The BMC FW monitors the percentage of time that a processor has been operationally constrained over a given time window (nominally six seconds) due to internal thermal management algorithms engaging to reduce the temperature of the device. This monitoring is instantiated as one IPMI analog/threshold sensor per processor package.

If this is not addressed, the processor will overheat and shut down the system to protect itself from damage.

Table 41: Processor Thermal Control Sensors Typical Characteristics

Revision 1.1 Intel order number G90620-002 53

Page 64

Cooling Subsystem

Byte

Field

Description

7Ah = Processor 3 Thermal Control % 7Bh = Processor 4 Thermal Control %

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

Event Data 1

[7:6] – 01b = Trigger reading in Event Data 2 [5:4] – 01b = Trigger threshold in Event Data 3 [3:0] – Event Triggers as described in Table 42

Event Data 2

Reading that triggered event

Event Data 3

Threshold value that triggered event

Event Trigger

Assertion

Severity

Deassert

Severity

Description

Hex

Description

07h

Upper non-critical going high

Degraded

The thermal margin has gone over its upper non-critical threshold.

09h

Upper critical going high

non-fatal

Degraded

The thermal margin has gone over its upper critical threshold.

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Table 42: Processor Thermal Control Sensors Event Triggers – Description

5.2.3.1 Processor Thermal Control % Sensors – Next Steps

These events normally occur due to failures of the thermal solution:

1. Verify heatsink is properly attached and has thermal grease.

2. If the system has a heatsink fan, ensure the fan is spinning.

3. Check all system fans are operating properly.

4. Check that the air used to cool the system is within limits (typically 35°C).

54 Intel order number G90620-002 Revision 1.1

Page 65

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Sensor Type

01h = Temperature

Sensor Number

83h = Processor 1 DTS Thermal Margin 84h = Processor 2 DTS Thermal Margin 85h = Processor 3 DTS Thermal Margin 86h = Processor 4 DTS Thermal Margin

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 01h (Threshold)

Cooling Subsystem

5.2.4 Processor DTS Thermal Margin Sensors

Intel® Xeon® processor E5-4600/2600/2400/1600 v2 product families are incorporating a DTS-based thermal spec. This allows a much more accurate control of the thermal solution and enables lower fan speeds and lower fan power consumption. For Intel® Xeon® processor E5-4600/2600/2400/1600 product families, this requires significant BMC FW calculations to derive the sensor value. Intel® Xeon® processor E5-4600/2600/2400/1600 v2 product families are the follow-on processors to Intel® Xeon® processor E54600/2600/2400/1600 product families. For Intel® Xeon® processor E5-4600/2600/2400/1600 v2 product families, the BMC’s derivation of this value is greatly simplified because the majority of the calculations are performed within the processor itself.

The main usage of this sensor is as an input to the BMC’s fan control algorithms. The BMC implements this as a threshold sensor. There is one DTS sensor for each installed physical processor package. Thresholds are not set and alert generation is not enabled for these sensors.

Table 43: Processor DTS Thermal Margin Sensors Typical Characteristics

5.2.5 Discrete Thermal Sensors

Discrete thermal sensors do not report a temperature at all, instead they report an overheating event of some kind. For example, VRD Hot (voltage regulator is overheating) or processor Thermal Trip (the processor got so hot that its over-temperature protection was triggered and the system was shut down to prevent damage).

Revision 1.1 Intel order number G90620-002 55

Page 66

Cooling Subsystem

Byte

Field

Description

Sensor Type

01h = Temperature

Sensor Number

See Table 45

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = See Table 45

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset as described in Table 45

Event Data 2

Not used

Event Data 3

Not used

Sensor

Number

Sensor Name

Event

Type

Event Trigger Offset

Description

Next Steps

Hex

Description

0Dh

SSB Thermal Trip

03h

01h

State Asserted

South Side Bridge (SSB) overheated

1. Check for clear and unobstructed airflow into and out of the chassis.

2. Ensure the SDR is programmed and correct chassis has been selected.

3. Ensure there are no fan failures.

4. Ensure the air used for cooling the system is within the thermal specifications for the system (typically below 35°C).

90h

P1 VRD Hot

05h

01h

Limit Exceeded

Processor 1 voltage regulator overheated

91h

P2 VRD Hot

Processor 2 voltage regulator overheated

92h

P3 VRD Hot

Processor 3 voltage regulator overheated

93h

P4 VRD Hot

Processor 4 voltage regulator overheated

94h

P1 Mem01 VRD Hot

Processor 1 Memory 0/1 voltage regulator overheated

95h

P1 Mem23 VRD Hot

Processor 1 Memory 2/3 voltage regulator overheated

96h

P2 Mem01 VRD Hot

Processor 2 Memory 0/1 voltage regulator overheated

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Table 44: Discrete Thermal Sensors Typical Characteristics

Table 45: Discrete Thermal Sensors – Next Steps

56 Intel order number G90620-002 Revision 1.1

Page 67

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Sensor

Number

Sensor Name

Event

Type

Event Trigger Offset

Description

Next Steps

Hex

Description

97h

P2 Mem23 VRD Hot

Processor 2 Memory 2/3 voltage regulator overheated

98h

P3 Mem01 VRD Hot

Processor 3 Memory 0/1 voltage regulator overheated

99h

P4 Mem23 VRD Hot

Processor 3 Memory 2/3 voltage regulator overheated

9Ah

P4 Mem01 VRD Hot

Processor 4 Memory 0/1 voltage regulator overheated

9Bh

P4 Mem23 VRD Hot

Processor 4 Memory 2/3 voltage regulator overheated

Byte

Field

Description

Sensor Type

0Ch = Memory

Sensor Number

C0h = Processor 1 DIMM Thermal Trip C1h = Processor 2 DIMM Thermal Trip C2h = Processor 3 DIMM Thermal Trip C3h = Processor 4 DIMM Thermal Trip

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3

Cooling Subsystem

5.2.6 DIMM Thermal Trip Sensors

The BMC supports DIMM Thermal Trip monitoring that is instantiated as one aggregate IPMI discrete sensor per CPU. When a DIMM Thermal Trip occurs, the system hardware will automatically power down the server and the BMC will assert the sensor offset and log an event.

Revision 1.1 Intel order number G90620-002 57

Table 46: DIMM Thermal Trip Typical Characteristics

Page 68

Cooling Subsystem

Byte

Field

Description

[3:0] – Event Trigger Offset = 0A = Critical over temperature

Event Data 2

Not used

Event Data 3

Not used

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

5.2.6.1 DIMM Thermal Trip Sensors – Next Steps

1. Check for clear and unobstructed airflow into and out of the chassis.

2. Ensure the SDR is programmed and correct chassis has been selected.

3. Ensure there are no fan failures.

4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).

5.3 System Air Flow Monitoring Sensor

The BMC provides an IPMI sensor to report the volumetric system airflow in CFM (cubic feet per minute). The airflow in CFM is calculated based on the system fan PWM values. The specific Pulse Width Modulation (PWM or PWMs) used to determine the CFM is SDR-configurable. The relationship between PWM and CFM is based on a lookup table in an OEM SDR.

The airflow data is used in the calculation for exit air temperature monitoring. It is exposed as an IPMI sensor to allow a data center management application to access this data for use in rack-level thermal management.

This sensor is informational only and will not log events into the SEL.

58 Intel order number G90620-002 Revision 1.1

Page 69

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Sensor Type

07h = Processor

Sensor Number

70h = Processor 1 Status 71h = Processor 2 Status 72h = Processor 3 Status 73h = Processor 4 Status

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset as described in Table 48

Event Data 2

Not used

Event Data 3

Not used

Processor Subsystem

6. Processor Subsystem

Intel® servers report multiple processor-centric sensors in the SEL.

6.1 Processor Status Sensor

The BMC provides an IPMI sensor of type processor for monitoring status information for each processor slot. If an event state (sensor offset) has been asserted, it remains asserted until one of the following happens:

 A rearm Sensor Events command is executed for the processor status sensor.  AC or DC power cycle, system reset, or system boot occurs.

CPU Presence status is not saved across A/C power cycles and therefore will not generate a deassertion after cycling AC power.

Table 47: Process Status Sensors Typical Characteristics

Revision 1.1 Intel order number G90620-002 59

Page 70

Processor Subsystem

Event Trigger

Offset

Processor Status

Next Steps

Internal error (IERR)

1. Cross test the processors.

2. Replace the processors depending on the results of the test.

Thermal trip

This event normally only happens due to failures of the thermal solution:

1. Verify heatsink is properly attached and has thermal grease.

2. If the system has a heatsink fan, ensure the fan is spinning.

3. Check all system fans are operating properly.

4. Check that the air used to cool the system is within limits (typically 35°C).

FRB1/BIST failure

1. Cross test the processors.

2. Replace the processors depending on the results of the test.

FRB2/Hang in POST failure

FRB3/Processor startup/initialization failure (CPU fails to start)

Configuration error (for DMI)

SM BIOS uncorrectable CPU-complex error

Processor presence detected

Informational Event

Processor disabled

1. Cross test the processors.

2. Replace the processors depending on the results of the test.

Terminator presence detected

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Table 48: Processor Status Sensors – Next Steps

60 Intel order number G90620-002 Revision 1.1

Page 71

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Sensor Type

07h = Processor

Sensor Number

80h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 03h (Digital Discrete)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset = 1h (State Asserted)

Event Data 2

Event Data 2 values as described in Table 50.

Event Data 3

Bitmap of the CPU that causes the system CATERR.

[0]: CPU1 [1]: CPU2 [2]: CPU3 [3]: CPU4

Note: If more than one bit is set, the BMC cannot determine the source of the CATERR.

ED2

Description

Next Steps

00h

Unknown

1. Cross test the processors.

2. Replace the processors depending on the results of the test.

Processor Subsystem

6.2 Catastrophic Error Sensor

When the Catastrophic Error signal (CATERR#) stays asserted, it is a sign that something serious has gone wrong in the hardware. The BMC monitors this signal and reports when it stays asserted.

Table 49: Catastrophic Error Sensor Typical Characteristics

Table 50: Catastrophic Error Sensor – Event Data 2 Values – Next Steps

Revision 1.1 Intel order number G90620-002 61

Page 72

Processor Subsystem

ED2

Description

Next Steps

01h

CATERR

This error is typically caused by other platform components.

1. Check for other errors near the time of the CATERR event.

2. Verify all peripherals are plugged in and operating correctly, particularly Hard Drives, Optical Drives, and I/O.

3. Update system firmware and drivers.

CPU Core Error

1. Cross test the processors.

2. Replace the processors depending on the results of the test.

MSID Mismatch

Verify the processor is supported by your baseboard. Check your boards Technical Product Specification (TPS).

Byte

Field

Description

Sensor Type

07h = Processor

Sensor Number

82h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset = 1h (State Asserted)

Event Data 2

Not used

Event Data 3

Not used

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

6.3 CPU Missing Sensor

The CPU Missing sensor is a discrete sensor reporting the processor is not installed. The most common instance of this event is due to a processor populated in the incorrect socket.

Table 51: CPU Missing Sensor Typical Characteristics

62 Intel order number G90620-002 Revision 1.1

Page 73

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

8 9 Generator ID

0001h = BIOS POST 11

Sensor Type

13h = Critical Interrupt

Sensor Number

09h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 77h (OEM Discrete)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset

Processor Subsystem

6.3.1 CPU Missing Sensor – Next Steps

Verify the processor is installed in the correct slot.

6.4 Quick Path Interconnect Sensors

The Intel® Quick Path Interconnect (QPI) bus on Intel® EPSD Boards Based on Intel® Xeon® Processor E5‑ 4600/2600/2400/1600/1400 Product Families is the interconnect between processors.

The QPI Link Width Reduced sensor is used by the BIOS POST to report when the link width has been reduced. Therefore the Generator ID will be 01h.

The QPI Error sensors are reported by the BIOS SMI Handler to the BMC so the Generator ID will be 33h.

6.4.1 QPI Link Width Reduced Sensor

BIOS POST has reduced the QPI Link Width because of an error condition seen during initialization.

Revision 1.1 Intel order number G90620-002 63

Table 52: QPI Link Width Reduced Sensor Typical Characteristics

Page 74

Processor Subsystem

Byte

Field

Description

1h = Reduced to ½ width 2h = Reduced to ¼ width

Event Data 2

0-3 = CPU1-4

Event Data 3

Not used

Byte

Field

Description

8 9 Generator ID

0033h = BIOS SMI Handler 11

Sensor Type

13h = Critical Interrupt

Sensor Number

06h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 72h (OEM Discrete)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset = Reserved

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

6.4.1.1 QPI Link Width Reduced Sensor – Next Steps

If the error continues:

1. Check the processor is installed correctly.

2. Inspect the socket for bent pins.

3. Cross test the processor. If the issue remains with the processor socket, replace the main board, otherwise the processor.

6.4.2 QPI Correctable Error Sensor

The system detected an error and corrected it. This is an informational event.

Table 53: QPI Correctable Error Sensor Typical Characteristics

64 Intel order number G90620-002 Revision 1.1

Page 75

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Event Data 2

0-3 = CPU1-4

Event Data 3

Not used

Byte

Field

Description

8 9 Generator ID

0033h = BIOS SMI Handler 11

Sensor Type

13h = Critical Interrupt

Sensor Number

07h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 73h (OEM Discrete)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset

0h = Link Layer Uncorrectable ECC Error 1h = Protocol Layer Poisoned Packet Reception Error

Processor Subsystem

6.4.2.1 QPI Correctable Error Sensor – Next Steps

This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:

1. Check the processor is installed correctly.

2. Inspect the socket for bent pins.

3. Cross test the processor. If the issue remains with the processor socket, replace the main board, otherwise the processor.

6.4.3 QPI Fatal Error and Fatal Error #2

The system detected a QPI fatal or non-recoverable error. This is a fatal error.

Table 54: QPI Fatal Error Sensor Typical Characteristics

Revision 1.1 Intel order number G90620-002 65

Page 76

Processor Subsystem

Byte

Field

Description

2h = Link/PHY Init Failure with resultant degradation in link width 3h = PHY Layer detected drift buffer alarm 4h = PHY detected latency buffer rollover 5h = PHY Init Failure 6h = Link Layer generic control error (buffer overflow/underflow, credit underflow and so on) 7h = Parity error in link or PHY layer 8h = Protocol layer timeout detected 9h = Protocol layer failed response Ah = Protocol layer illegal packet field, target Node ID Error, and so on Bh = Protocol Layer Queue/table overflow/underflow Ch = Viral Error Dh = Protocol Layer parity error Eh = Routing Table Error Fh = (unused) = Reserved

Event Data 2

0-3 = CPU1-4

Event Data 3

Not used

Byte

Field

Description

8 9 Generator ID

0033h = BIOS SMI Handler 11

Sensor Type

13h = Critical Interrupt

Sensor Number

17h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 74h (OEM Discrete)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

The QPI Fatal Error #2 is a continuation of QPI Fatal Error.

Table 55: QPI Fatal #2 Error Sensor Typical Characteristics

66 Intel order number G90620-002 Revision 1.1

Page 77

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

[5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset

0h = Illegal inbound request 1h = IIO Write Cache Uncorrectable Data ECC Error 2h = IIO CSR crossing 32-bit boundary Error 3h = IIO Received XPF physical/logical redirect interrupt inbound 4h = IIO Illegal SAD or Illegal or non-existent address or memory 5h = IIO Write Cache Coherency Violation

Event Data 2

0-3 = CPU1-4

Event Data 3

Not used

Processor Subsystem

6.4.3.1 QPI Fatal Error and Fatal Error #2 – Next Steps

This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:

1. Check the processor is installed correctly.

2. Inspect the socket for bent pins.

3. Cross test the processor. If the issue remains with the processor socket, replace the main board, otherwise the processor.

6.5 Processor ERR2 Timeout Sensor

The BMC supports an ERR2 Timeout Sensor (1 per CPU) that asserts if a CPU’s ERR2 signal has been asserted for longer than a fixed time period (> 90 seconds). ERR[2] is a processor signal that indicates when the IIO (Integrated IO module in the processor) has a fatal error which could not be communicated to the core to trigger SMI. ERR[2] events are fatal error conditions, where the BIOS and OS will attempt to gracefully handle error, but may not always do so reliably. A continuously asserted ERR2 signal is an indication that the BIOS cannot service the condition that caused the error. This is usually because that condition prevents the BIOS from running.

When an ERR2 timeout occurs, the BMC asserts/deasserts the ERR2 Timeout Sensor, and logs a SEL event for that sensor. The default behavior for BMC core firmware is to initiate a system reset upon detection of an ERR2 timeout. The BIOS setup utility provides an option to disable or enable system reset by the BMC on detection of this condition.

Revision 1.1 Intel order number G90620-002 67

Page 78

Processor Subsystem

Byte

Field

Description

Sensor Type

07h = Processor

Sensor Number

7Ch = Processor 1 ERR2 Timeout 7Dh = Processor 2 ERR2 Timeout 7Eh = Processor 3 ERR2 Timeout 7Fh = Processor 4 ERR2 Timeout

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 03h (“digital” discrete)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset = 1h (State Asserted)

Event Data 2

Not used

Event Data 3

Not used

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Table 56: Processor ERR2 Timeout Sensor Typical Characteristics

6.5.1 Processor ERR2 Timeout – Next Steps

1. Check the SEL for any other events around the time of the failure.

2. Take note of all IPMI activity that was occurring around the time of the failure. Capture a System BMC Debug Log as soon as you can after experiencing this failure. This log can be captured from the Integrated BMC Web Console or by using the Intel® Syscfg utility (syscfg /sbmcdl private filename.zip). Send the log file to your system manufacturer or Intel representative for failure analysis.

6.6 Processor MSID Mismatch Sensor

The BMC supports a MSID Mismatch sensor for monitoring for the fault condition that will occur if there is a power rating incompatibility between a baseboard and a processor.

The sensor is rearmed on power-on (AC or DC power on transitions).

68 Intel order number G90620-002 Revision 1.1

Page 79

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Sensor Type

07h = Processor

Sensor Number

81h = Processor 1 MSID Mismatch 87h = Processor 2 MSID Mismatch 88h = Processor 3 MSID Mismatch 89h = Processor 4 MSID Mismatch

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 03h (“digital” discrete)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset = 1h (State Asserted)

Event Data 2

Not used

Event Data 3

Not used

Processor Subsystem

Table 57: Processor MSID Mismatch Sensor Typical Characteristics

6.6.1 Processor MSID Mismatch Sensor – Next Steps

Verify the processor is supported by your baseboard. Check your boards Technical Product Specification (TPS).

Revision 1.1 Intel order number G90620-002 69

Page 80

Memory Subsystem

Byte

Field

Description

8 9 Generator ID

0001h = BIOS POST 11

Sensor Type

0ch = Memory

Sensor Number

02h

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

7. Memory Subsystem

Intel® servers report memory errors, status, and configuration in the SEL.

7.1 Memory RAS Configuration Status

A Memory RAS Configuration Status event is logged after an AC power-on occurs, only if any RAS Mode is currently configured, and only if RAS Mode is successfully initiated.

This is to make sure that there is a record in the SEL telling what the RAS Mode was at the time that the system started up. This is only logged after AC power-on, not DC power-on.

The Memory RAS Configuration Status Sensor is also used to log an event during POST whenever there is a RAS configuration error. This is a case where a RAS Mode has been selected but when the system boots, the memory configuration cannot support the RAS Mode. The memory configuration fails, and operates in Independent Channel Mode.

In the SEL record logged, the ED1 Offset value is “RAS Configuration Disabled”, and ED3 contains the RAS Mode that is currently selected but could not be configured. ED2 gives the reason for the RAS configuration failure – at present, only two “RAS Configuration Error Type” values are implemented:

70 Intel order number G90620-002 Revision 1.1

0 = None – This is used for an AC power-on log record when the RAS configuration is successfully configured. 3 = Invalid DIMM Configuration for RAS Mode – The installed DIMM configuration cannot support the currently selected RAS

Mode. This may be due to DIMMs that have failed or been disabled, so when this reason has been logged, the user should check the preceding SEL events to see whether there are DIMM error events.

Table 58: Memory RAS Configuration Status Sensor Typical Characteristics

Page 81

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 09h (digital Discrete)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset as described in Table 59

Event Data 2

RAS Configuration Error Type [7:4] = Reserved [3:0] = Configuration Error

0 = None 3 = Invalid DIMM Configuration for RAS Mode All other values are reserved.

Event Data 3

RAS Mode Configured [7:4] = Reserved [3:0] = RAS Mode

0h = None (Independent Channel Mode) 1h = Mirroring Mode 2h = Lockstep Mode 4h = Rank Sparing Mode

Event Trigger Offset

Description

Next Steps

Hex

Description

01h

RAS configuration enabled.

User enabled mirrored channel mode in setup.

Informational event only.

00h

RAS configuration disabled.

Mirrored channel mode is disabled (either in setup or due to unavailability of memory at post, in which case post error 8500 is also logged).

1. If this event is accompanied by a post error 8500, there was a problem applying the mirroring configuration to the memory. Check for other errors related to the memory and troubleshoot accordingly.

2. If there is no post error, mirror mode was simply disabled in BIOS setup and this should be considered informational only.

Memory Subsystem

Table 59: Memory RAS Configuration Status Sensor – Event Trigger Offset – Next Steps

Revision 1.1 Intel order number G90620-002 71

Page 82

Memory Subsystem

Byte

Field

Description

8 9 Generator ID

0001h = BIOS POST 11

Sensor Type

0ch = Memory

Sensor Number

12h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 09h (digital Discrete)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset

0h = RAS Configuration Disabled 1h = RAS Configuration Enabled

Event Data 2

Prior RAS Mode [7:4] = Reserved [3:0] = RAS Mode

0h = None (Independent Channel Mode) 1h = Mirroring Mode 2h = Lockstep Mode 4h = Rank Sparing Mode

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

7.2 Memory RAS Mode Select

Memory RAS Mode Select events are logged to record changes in RAS Mode. When a RAS Mode selection is made that changes the RAS Mode (including selecting a RAS Mode from or to Independent Channel

Mode), that change is logged to SEL in a Memory RAS Mode Select event message, which records the previous RAS Mode (from) and the newly selected RAS Mode (to). The event also includes an Offset value in ED1 which indicates whether the mode change left the system with a RAS Mode active (Enabled), or not (Disabled – Independent Channel Mode selected).This sensor provides the Spare Channel mode RAS Configuration status. Memory RAS Mode Select is an informational event.

Table 60: Memory RAS Mode Select Sensor Typical Characteristics

72 Intel order number G90620-002 Revision 1.1

Page 83

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Event Data 3

Selected RAS Mode [7:4] = Reserved [3:0] = RAS Mode

0h = None (Independent Channel Mode) 1h = Mirroring Mode 2h = Lockstep Mode 4h = Rank Sparing Mode

Byte

Field

Description

8 9 Generator ID

0033h = BIOS SMI Handler 11

Sensor Type

0ch = Memory

Sensor Number

01h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 0Bh (Generic Discrete)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset

0h = Fully Redundant 2h = Redundancy Degraded

Memory Subsystem

7.3 Mirroring Redundancy State

Mirroring Mode protects memory data by full redundancy – keeping complete copies of all data on both channels of a Mirroring Domain (channel pair). If an Uncorrectable Error, which is normally fatal, occurs on one channel of a pair, and the other channel is still intact and operational, then the Uncorrectable Error is “demoted” to a Correctable Error, and the failed channel is disabled. Because the Mirror Domain is no longer redundant, a Mirroring Redundancy State SEL Event is logged.

Table 61: Mirroring Redundancy State Sensor Typical Characteristics

Revision 1.1 Intel order number G90620-002 73

Page 84

Memory Subsystem

Byte

Field

Description

Event Data 2

Location [7:4] = Mirroring Domain

0-1 = Channel Pair for Socket [3:2] = Reserved [1:0] = Rank on DIMM

0-3 = Rank Number

Event Data 3

Location [7:5] = Socket ID

0-3 = CPU1-4 [4:3] = Channel

0-3 = Channel A-D for Socket [2:0] = DIMM

0-2 = DIMM 1-3 on Channel

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

7.3.1 Mirroring Redundancy State Sensor – Next Steps

This event is accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected DIMM).

For boards with DIMM Fault LEDs, the appropriate Fault LED is lit to indicate which DIMM was the source of the error triggering the Mirroring Failover action, that is, the failing DIMM.

7.4 Sparing Redundancy State

Rank Sparing Mode is a Memory RAS configuration option that reserves one memory rank per channel as a “spare rank”. If any rank on a given channel experiences enough Correctable ECC Errors to cross the Correctable Error Threshold, the data in that rank is copied to the spare rank, and then the spare rank is mapped into the memory array to replace the failing rank.

Rank Sparing Mode protects memory data by reserving a “Spare Rank” on each channel that has memory installed on it. If a Correctable Error Threshold event occurs, the data from the failing rank is copied to the Spare Rank on the same channel, and the failing DIMM is disabled. Because the Sparing Domain is no longer redundant, a Sparing Redundancy State SEL Event is logged.

74 Intel order number G90620-002 Revision 1.1

Page 85

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

8 9 Generator ID

0033h = BIOS SMI Handler 11

Sensor Type

0ch = Memory

Sensor Number

11h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 0Bh (Generic Discrete)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset

0h = Fully Redundant 2h = Redundancy Degraded

Event Data 2

Location [7:4] = Sparing Domain

0-3 = Channel A-D for Socket [3:2] = Reserved [1:0] = Rank on DIMM

0-3 = Rank Number

Event Data 3

Location [7:5]= Socket ID

0-3 = CPU1-4 [4:3] = Channel

0-3 = Channel A-D for Socket [2:0] = DIMM

0-2 = DIMM 1-3 on Channel

Table 62: Sparing Redundancy State Sensor Typical Characteristics

Memory Subsystem

Revision 1.1 Intel order number G90620-002 75

Page 86

Memory Subsystem

Byte

Field

Description

8 9 Generator ID

0033h = BIOS SMI Handler 11

Sensor Type

0ch = Memory

Sensor Number

02h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

7.4.1 Sparing Redundancy State Sensor – Next Steps

This event is accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected DIMM).

For boards with DIMM Fault LEDs, the appropriate Fault LED is lit to indicate which DIMM was the source of the error triggering the Mirroring Failover action, that is, the failing DIMM.

7.5 ECC and Address Parity

1. Memory data errors are logged as correctable or uncorrectable.

2. Uncorrectable errors are fatal.

3. Memory addresses are protected with parity bits and a parity error is logged. This is a fatal error.

7.5.1 Memory Correctable and Uncorrectable ECC Error

ECC errors are divided into Uncorrectable ECC Errors and Correctable ECC Errors. A “Correctable ECC Error” actually represents a threshold overflow. More Correctable Errors are detected at the memory controller level for a given DIMM within a given timeframe. In both cases, the error can be narrowed down to particular DIMM(s). The BIOS SMI error handler uses this information to log the data to the BMC SEL and identify the failing DIMM module.

76 Intel order number G90620-002 Revision 1.1

Table 63: Correctable and Uncorrectable ECC Error Sensor Typical Characteristics

Page 87

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

[5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset as described in Table 64

Event Data 2

[7:2] – Reserved. Set to 0. [1:0] – Rank on DIMM

0-3 = Rank number

Event Data 3

[7:5] – Socket ID

0-3 = CPU1-4

[4:3] –Channel

0-3 = Chan A-D for Socket

[2:0] DIMM

0-2 = DIMM 1-3 on Channel

Event Trigger Offset

Description

Next Steps

Hex

Description

01h

Uncorrectable ECC Error

An uncorrectable (multi-bit) ECC error has occurred. This is a fatal issue that will typically lead to an OS crash (unless memory has been configured in a RAS mode). The system will generate a CATERR# (catastrophic error) and an MCE (Machine Check Exception Error).

While the error may be due to a failing DRAM chip on the DIMM, it can also be cause by incorrect seating or improper contact between socket and DIMM, or by bent pins in the processor socket.

1. If needed, decode DIMM location from hex version of SEL.

2. Verify the DIMM is seated properly.

3. Examine gold fingers on edge of the DIMM to verify

contacts are clean.

4. Inspect the processor socket this DIMM is connected to for

bent pins, and if found, replace the board.

5. Consider replacing the DIMM as a preventative measure.

For multiple occurrences, replace the DIMM.

00h

Correctable ECC Error threshold reached

There have been too many (10 or more) correctable ECC errors for this particular DIMM since last boot. This event in itself does not pose any direct problems because the ECC errors are still being corrected. Depending on the RAS configuration of the memory, the IMC may take the affected DIMM offline.

Even though this event doesn't immediately lead to problems, it can indicate one of the DIMM modules is slowly failing. If this error occurs more than once:

1. If needed, decode DIMM location from hex version of SEL.

2. Verify the DIMM is seated properly.

3. Examine gold fingers on edge of the DIMM to verify

contacts are clean.

4. Inspect the processor socket this DIMM is connected to for

bent pins, and if found, replace the board.

Table 64: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps

Memory Subsystem

Revision 1.1 Intel order number G90620-002 77

Page 88

Memory Subsystem

Event Trigger Offset

Description

Next Steps

Hex

Description

5. Consider replacing the DIMM as a preventative measure.

For multiple occurrences, replace the DIMM.

Byte

Field

Description

8 9 Generator ID

0033h = BIOS SMI Handler 11

Sensor Type

0ch = Memory

Sensor Number

13h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset = 2h

Event Data 2

[7:5] – Reserved. Set to 0. [4] – Channel Information Validity Check:

0b = Channel Number in Event Data 3 Bits[4:3] is not valid 1b = Channel Number in Event Data 3 Bits[4:3] is valid

[3] – DIMM Information Validity Check:

0b = DIMM Slot ID in Event Data 3 Bits[2:0] is not valid

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

7.5.2 Memory Address Parity Error

Address Parity errors are errors detected in the memory addressing hardware. Because these affect the addressing of memory contents, they can potentially lead to the same sort of failures as ECC errors. They are logged as a distinct type of error because they affect memory addressing rather than memory contents, but otherwise they are treated exactly the same as Uncorrectable ECC Errors. Address Parity errors are logged to the BMC SEL, with Event Data to identify the failing address by channel and DIMM to the extent that it is possible to do so.

Table 65: Address Parity Error Sensor Typical Characteristics

78 Intel order number G90620-002 Revision 1.1

Page 89

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

1b = DIMM Slot ID in Event Data 3 Bits[2:0] is valid

[2:0] – Error Type:

000b = Parity Error Type not known 001b = Data Parity Error (not used) 010b = Address Parity Error All other values are reserved.

Event Data 3

[7:5] – Indicates the Processor Socket to which the DDR3 DIMM having the ECC error is attached:

0-3 = CPU1-4 All other values are reserved.

[4:3] – Channel Number (if valid) on which the Parity Error occurred. This value will be indeterminate and should be ignored if ED2

Bit [4] is 0b.

00b = Channel A 01b = Channel B 10b = Channel C 11b = Channel D

[2:0] – DIMM Slot ID (if valid) of the specific DIMM that was involved in the transaction that led to the parity error. This value will

be indeterminate and should be ignored if ED2 Bit [3] is 0b.

000b = DIMM Socket 1 001b = DIMM Socket 2 010b = DIMM Socket 3 All other values are reserved.

Memory Subsystem

7.5.2.1 Memory Address Parity Error Sensor – Next Steps

These are bit errors that are detected in the memory addressing hardware. An Address Parity Error implies that the memory address transmitted to the DIMM addressing circuitry has been compromised, and data read or written is compromised in turn. An Address Parity Error is logged as such in SEL but in all other ways is treated the same as an Uncorrectable ECC Error.

While the error may be due to a failing DRAM chip on the DIMM, it can also be cause by incorrect seating or improper contact between the socket and DIMM, or by the bent pins in the processor socket.

1. If needed, decode DIMM location from hex version of SEL.

2. Verify the DIMM is seated properly.

3. Examine gold fingers on edge of the DIMM to verify contacts are clean.

Revision 1.1 Intel order number G90620-002 79

Page 90

Memory Subsystem

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

4. Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.

5. Consider replacing the DIMM as a preventative measure. For multiple occurrences, replace the DIMM.

80 Intel order number G90620-002 Revision 1.1

Page 91

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

8 9 Generator ID

0033h = BIOS SMI Handler 11

Sensor Type

13h = Critical Interrupt

Sensor Number

03h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset

PCI Express* and Legacy PCI Subsystem

8. PCI Express* and Legacy PCI Subsystem

The PCI Express* (PCIe) Specification defines standard error types under the Advanced Error Reporting (AER) capabilities. The BIOS logs AER events into the SEL.

The Legacy PCI Specification error types are PERR and SERR. These errors are supported and logged into the SEL.

8.1 PCI Express* Errors

PCIe error events are either correctable (informational event) or fatal. In both cases information is logged to help identify the source of the PCIe error and the bus, device, and function is included in the extended data fields. The PCIe devices are mapped in the operating system by bus, device, and function. Each device is uniquely identified by the bus, device, and function. PCIe device information can be found in the operating system.

8.1.1 Legacy PCI Errors

Legacy PCI errors include PERR and SERR; both are fatal errors.

Revision 1.1 Intel order number G90620-002 81

Table 66: Legacy PCI Error Sensor Typical Characteristics

Page 92

PCI Express* and Legacy PCI Subsystem

Byte

Field

Description

4h = PCI PERR 5h = PCI SERR

Event Data 2

PCI Bus number

Event Data 3

[7:3] – PCI Device number [2:0] – PCI Function number

Byte

Field

Description

8 9 Generator ID

0033h = BIOS SMI Handler 11

Sensor Type

13h = Critical Interrupt

Sensor Number

04h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 70h (OEM Specific)

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

8.1.1.1 Legacy PCI Error Sensor – Next Steps

1. Decode the bus, device, and function to identify the card.

2. If this is an add-in card: a. Verify the card is inserted properly. b. Install the card in another slot and check whether the error follows the card or stays with the slot. c. Update all firmware and drivers, including non-Intel components.

3. If this is an on-board device: a. Update all BIOS, firmware, and drivers. b. Replace the board.

8.1.2 PCI Express* Fatal Errors and Fatal Error #2

When a PCI Express* fatal error is reported to the BIOS SMI handler, it will record the error using the following format.

82 Intel order number G90620-002 Revision 1.1

Table 67: PCI Express* Fatal Error Sensor Typical Characteristics

Page 93

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger

0h = Data Link Layer Protocol Error 1h = Surprise Link Down Error 2h = Completer Abort 3h = Unsupported Request 4h = Poisoned TLP 5h = Flow Control Protocol 6h = Completion Timeout 7h = Receiver Buffer Overflow 8h = ACS Violation 9h = Malformed TLP Ah = ECRC Error Bh = Received Fatal Message From Downstream Ch = Unexpected Completion Dh = Received ERR_NONFATAL Message Eh = Uncorrectable Internal Fh = MC Blocked TLP

Event Data 2

PCI Bus number

Event Data 3

[7:3] – PCI Device number [2:0] – PCI Function number

Byte

Field

Description

8 9 Generator ID

0033h = BIOS SMI Handler 11

Sensor Type

13h = Critical Interrupt

PCI Express* and Legacy PCI Subsystem

The PCI Express* Fatal Error #2 is a continuation of the PCI Express* Fatal Error.

Revision 1.1 Intel order number G90620-002 83

Table 68: PCI Express* Fatal Error #2 Sensor Typical Characteristics

Page 94

PCI Express* and Legacy PCI Subsystem

Byte

Field

Description

Sensor Number

14h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 76h (OEM Specific)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset

0h = Atomic Egress Blocked 1h = TLP Prefix Blocked Fh = Unspecified Non-AER Fatal Error

Event Data 2

PCI Bus number

Event Data 3

[7:3] – PCI Device number [2:0] – PCI Function number

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

8.1.2.1 PCI Express* Fatal Error and Fatal Error #2 Sensor – Next Steps

1. Decode the bus, device, and function to identify the card.

3. If this is an on-board device: a. Update all BIOS, firmware, and drivers. b. Replace the board.

8.1.3 PCI Express* Correctable Errors

When a PCI Express* correctable error is reported to the BIOS SMI handler, it will record the error using the following format.

84 Intel order number G90620-002 Revision 1.1

Page 95

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

8 9 Generator ID

0033h = BIOS SMI Handler 11

Sensor Type

13h = Critical Interrupt

Sensor Number

05h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 71h (OEM Specific)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset

0h = Receiver Error 1h = Bad DLLP 2h = Bad TLP 3h = Replay Num Rollover 4h = Replay Timer timeout 5h = Advisory Non-fatal 6h = Link BW Changed 7h = Correctable Internal 8h = Header Log Overflow Fh = Unspecified Non-AER Correctable Error

Event Data 2

PCI Bus number

Event Data 3

[7:3] – PCI Device number [2:0] – PCI Function number

PCI Express* and Legacy PCI Subsystem

Table 69: PCI Express* Correctable Error Sensor Typical Characteristics

Revision 1.1 Intel order number G90620-002 85

Page 96

PCI Express* and Legacy PCI Subsystem

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

8.1.3.1 PCI Express* Correctable Error Sensor – Next Steps

This is an informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:

1. Decode the bus, device, and function to identify the card.

3. If this is an on-board device: a. Update all BIOS, firmware, and drivers. b. Replace the board.

86 Intel order number G90620-002 Revision 1.1

Page 97

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

System BIOS Events

9. System BIOS Events

There are a number of events that are owned by the system BIOS. These events can occur during Power On Self Test (POST) or when coming out of a sleep state. Not all of these events signify errors. Some events are described in other chapters in this document (for example, memory events).

9.1 System Events

These events can occur during POST or when coming out of a sleep state. These are informational events only.

1. When logging events during BIOS POST uses generator ID 0001h.

2. When logging events during BIOS SMI Handler uses generator ID 0033h.

9.1.1 System Boot

At the end of POST, just before the actual OS boot occurs, a System Boot Event is logged. This basically serves to mark the transition of control from completed POST to OS Loader. It is an informational only event.

9.1.2 Timestamp Clock Synchronization

These events are used when the time between the BIOS and the BMC is synchronized. Two events are logged. The BIOS does the first one to send the time synch message to the BMC for synchronization, and the timestamp that message gets is unknown, that is, the timestamp in the log can be anything because it gets the "before" timestamp.

So the BIOS sends a second time synch message to get a "baseline" correct timestamp in the log. That is the "starting time". For example, say that the time the BMC has is March 1, 2011 21:00. The BIOS time synch updates that to the same date, 21:20 (the

BMC was running behind). Without that second time synch message, you don't know that the log time jumped ahead, and when you get the next log message it looks like there was a 20-min delay during the boot for some unknown reasons.

Without that second time synch message, the time span to the next logged message is indeterminate. With the second time synch as a baseline, the following log timestamps are always determinate.

Revision 1.1 Intel order number G90620-002 87

Page 98

System BIOS Events

Byte

Field

Description

8 9 Generator ID

 0001h = BIOS POST  0033h = BIOS SMI Handler

Sensor Type

12h = System Event

Sensor Number

83h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

Event Data 1

[7:6] – 00b = Unspecified Event Data 2 [5:4] – 00b = Unspecified Event Data 3 [3:0] – Event Trigger Offset

01h = System Boot 05h = Timestamp Clock Synchronization

Event Data 2

For Event Trigger Offset 05h only (Timestamp Clock Synchronization)

00h = 1st in pair 80h = 2nd in pair

Event Data 3

Not used

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

The timestamp clock synchronization is run and the events are logged by the BIOS POST every time the system boots. In addition during the shutdown from some Operating Systems the BIOS SMI Handler is called to run timestamp clock synchronization and log the events.

Table 70: System Event Sensor Typical Characteristics

88 Intel order number G90620-002 Revision 1.1

Page 99

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Byte

Field

Description

8 9 Generator ID

0001h = BIOS POST 11

Sensor Type

0Fh = System Firmware Progress (formerly POST Error)

Sensor Number

06h

Event Direction and Event Type

[7] Event direction

0b = Assertion Event 1b = Deassertion Event

[6:0] Event Type = 6Fh (Sensor Specific)

Event Data 1

[7:6] – 10b = OEM code in Event Data 2 [5:4] – 10b = OEM code in Event Data 3 [3:0] – Event Trigger Offset = 0h

Event Data 2

Low Byte of POST Error Code

Event Data 3

High Byte of POST Error Code

System BIOS Events

9.2 System Firmware Progress (Formerly Post Error)

The BIOS logs any POST errors to the SEL. The 2-byte POST code gets logged in the ED2 and ED3 bytes in the SEL entry. This event will be logged every time a POST error is displayed. Even though this event indicates an error, it may not be a fatal error. If this is a serious error, there will typically also be a corresponding SEL entry logged for whatever was the cause of the error – this event may contain more information about what happened than the POST error event.

Table 71: POST Error Sensor Typical Characteristics

9.2.1 System Firmware Progress (Formerly Post Error) – Next Steps

See the following table for POST Error Codes.

Revision 1.1 Intel order number G90620-002 89

Page 100

System BIOS Events

Error Code

Error Message

Response

0012

System RTC date/time not set

Major

0048

Password check failed

Major

0140

PCI component encountered a PERR error

Major

0141

PCI resource conflict

Major

0146

PCI out of resources error

Major

0191

Processor core/thread count mismatch detected

Fatal

0192

Processor cache size mismatch detected

Fatal

0194

Processor family mismatch detected

Fatal

0195

Processor Intel(R) QPI link frequencies unable to synchronize

Fatal

0196

Processor model mismatch detected

Fatal

0197

Processor frequencies unable to synchronize

Fatal

5220

BIOS Settings reset to default settings

Major

5221

Passwords cleared by jumper

Major

5224

Password clear jumper is Set

Major

8130

Processor 01 disabled

Major

8131

Processor 02 disabled

Major

8132

Processor 03 disabled

Major

8133

Processor 04 disabled

Major

8160

Processor 01 unable to apply microcode update

Major

8161

Processor 02 unable to apply microcode update

Major

8162

Processor 03 unable to apply microcode update

Major

8163

Processor 04 unable to apply microcode update

Major

8170

Processor 01 failed Self Test (BIST)

Major

8171

Processor 02 failed Self Test (BIST)

Major

8172

Processor 03 failed Self Test (BIST)

Major

8173

Processor 04 failed Self Test (BIST)

Major

8180

Processor 01 microcode update not found

Minor

8181

Processor 02 microcode update not found

Minor

8182

Processor 03 microcode update not found

Minor

8183

Processor 04 microcode update not found

Minor

System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families

Table 72: POST Error Codes

90 Intel order number G90620-002 Revision 1.1

Kontron S4600 SEL Troubleshooting

Specifications and Main Features

Frequently Asked Questions

User Manual

1. Introduction

1.1 Purpose

1.2 Industry Standard

1.2.1 Intelligent Platform Management Interface (IPMI)

1.2.2 Baseboard Management Controller (BMC)

1.2.2.1 System Event Log (SEL)

2. Basic Decoding of a SEL Record

2.1 Default Values in the SEL Records

2.2 Notes on SEL Logs and Collecting SEL Information

2.2.1 Examples of Decoding BIOS Timestamp Events

2.2.1.1 BIOS POST Timestamp Events

2.2.1.2 BIOS SMI Handler Timestamp Events

2.2.2 Example of Decoding a PCI Express* Correctable Error Events

2.2.3 Example of Decoding a Power Supply Predictive Failure Event

3. Sensor Cross Reference List

3.1 BMC owned Sensors (GID = 0020h)

3.2 BIOS POST owned Sensors (GID = 0001h)

3.3 BIOS SMI Handler owned Sensors (GID = 0033h)

3.4 Node Manager / ME Firmware owned Sensors (GID = 002Ch or 602Ch)

3.5 Microsoft* OS owned Events (GID = 0041)

3.6 Linux* Kernel Panic Events (GID = 0021)

4. Power Subsystems

4.1 Threshold-based Voltage Sensors

4.2 Voltage Regulator Watchdog Timer Sensor

4.2.1 Voltage Regulator Watchdog Timer Sensor – Next Steps

4.3 Power Unit

4.3.1 Power Unit Status Sensor

4.3.2 Power Unit Redundancy Sensor

4.3.3 Node Auto Shutdown Sensor

4.3.3.1 Node Auto Shutdown Sensor – Next Steps

4.4 Power Supply

4.4.1 Power Supply Status Sensors

4.4.2 Power Supply Power In Sensors

4.4.3 Power Supply Current Out % Sensors

4.4.4 Power Supply Temperature Sensors

4.4.5 Power Supply Fan Tachometer Sensors

4.4.5.1 Power Supply Fan Tachometer Sensors – Next Steps

5. Cooling Subsystem

5.1 Fan Sensors

5.1.1 Fan Tachometer Sensors

5.1.2 Fan Presence and Redundancy Sensors

5.2 Temperature Sensors

5.2.1 Threshold-based Temperature Sensors

5.2.2 Thermal Margin Sensors

5.2.3 Processor Thermal Control Sensors

5.2.3.1 Processor Thermal Control % Sensors – Next Steps

5.2.4 Processor DTS Thermal Margin Sensors

5.2.5 Discrete Thermal Sensors

5.2.6 DIMM Thermal Trip Sensors

5.2.6.1 DIMM Thermal Trip Sensors – Next Steps

5.3 System Air Flow Monitoring Sensor

6. Processor Subsystem

6.1 Processor Status Sensor

6.2 Catastrophic Error Sensor

6.3 CPU Missing Sensor

6.3.1 CPU Missing Sensor – Next Steps

6.4 Quick Path Interconnect Sensors

6.4.1 QPI Link Width Reduced Sensor

6.4.1.1 QPI Link Width Reduced Sensor – Next Steps

6.4.2 QPI Correctable Error Sensor

6.4.2.1 QPI Correctable Error Sensor – Next Steps

6.4.3 QPI Fatal Error and Fatal Error #2

6.4.3.1 QPI Fatal Error and Fatal Error #2 – Next Steps

6.5 Processor ERR2 Timeout Sensor

6.5.1 Processor ERR2 Timeout – Next Steps

6.6 Processor MSID Mismatch Sensor

6.6.1 Processor MSID Mismatch Sensor – Next Steps

7. Memory Subsystem

7.1 Memory RAS Configuration Status

7.2 Memory RAS Mode Select

7.3 Mirroring Redundancy State

7.3.1 Mirroring Redundancy State Sensor – Next Steps

7.4 Sparing Redundancy State

7.4.1 Sparing Redundancy State Sensor – Next Steps

7.5 ECC and Address Parity

7.5.1 Memory Correctable and Uncorrectable ECC Error