System Event Log Troubleshooting
Guide for EPSD Platforms Based on
Intel® Xeon® Processor E5
4600/2600/2400/1600/1400
Product Families
Intel order number G90620-002
Revision 1.1
September 2013
Enterprise Platforms and Services Division – Marketing
Page 2
Revision History System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
Date
Revision
Number
Modifications
January 2013
1.0
Initial release
September 2013
1.1
Added MIC Thermal Margin sensors C4 through C7.
Added MIC Status sensors A2, A3, A6, and A7.
Added voltage sensors EA, EB, EC, ED, and EF.
Corrected typographical errors.
Made corrections to Firmware Update Status table.
Made corrections to Catastrophic Error Sensor table.
Added support for S1400FP, S1400SP, S1600JP, and S4600LH.
4600/2600/2400/1600/1400 Product Families
Revision History
ii Intel order number G90620-002 Revision 1.1
Page 3
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
4600/2600/2400/1600/1400 Product Families Disclaimers
Disclaimers
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS
GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR
SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR
INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly,
in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION
CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES,
SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH,
HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS'
FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL
INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR
NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF
THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not
rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves
these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from
future changes to them. The information here is subject to change without notice. Do not finalize a design with this
information.
The products described in this document may contain design defects or errors known as errata which may cause the
product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your
product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may
be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.
Revision 1.1 Intel order number G90620-002 iii
Page 4
Table of Contents System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
5.1 Fan Sensors .......................................................................................................... 45
5.1.1 Fan Tachometer Sensors ...................................................................................... 45
5.1.2 Fan Presence and Redundancy Sensors .............................................................. 46
5.2 Temperature Sensors ............................................................................................ 49
iv Intel order number G90620-002 Revision 1.1
Page 5
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
4600/2600/2400/1600/1400 Product Families Table of Contents
5.2.1 Threshold-based Temperature Sensors ................................................................ 49
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
4600/2600/2400/1600/1400 Product Families Table of Contents
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
4600/2600/2400/1600/1400 Product Families Introduction
1. Introduction
The server management hardware that is part of the Intel® Server Boards and Intel® Server
Platforms serves as a vital part of the overall server management strategy. The server
management hardware provides essential information to the system administrator and provides
the administrator the ability to remotely control the server, even when the operating system is
not running.
The Intel® Server Boards and Intel® Server Platforms offer comprehensive hardware and
software based solutions. The server management features make the servers simple to manage
and provide alerting on system events. From entry to enterprise systems, good overall server
management is essential to reduce overall total cost of ownership.
This Troubleshooting Guide is intended to help the users better understand the events that are
logged in the Baseboard Management Controllers (BMC) System Event Logs (SEL) on these
Intel® Server Boards.
There is a separate User’s Guide that covers the general server management and the server
management software offered on the Intel® Server Boards and Intel® Server Platforms.
Server boards currently supported by this document:
Intel® S1400FP Server Boards
Intel® S1400SP Server Boards
Intel® S1600JP Server Boards
Intel® S2400BB Server Boards
Intel® S2400EP Server Boards
Intel® S2400GP Server Boards
Intel® S2400LP Server Boards
Intel® S2400SC Server Boards
Intel® S2600CO Server Boards
Intel® S2600CP Server Boards
Intel® S2600GZ/S2600GL Server Boards
Intel® S2600IP Server Boards
Intel® S2600JF Server Boards
Intel® S2600WP Server Boards
Intel® S4600LH Server Boards
Intel® W2600CR Workstation Boards
1.1 Purpose
The purpose of this document is to list all possible events generated by the Intel platform. It may
be possible that other sources (not under our control) also generate events, which will not be
described in this document.
Revision 1.1 Intel order number G90620-002 1
Page 12
Introduction System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
The key characteristic of the Intelligent Platform Management Interface (IPMI) is that the
inventory, monitoring, logging, and recovery control functions are available independently of the
main processors, BIOS, and operating system. Platform management functions can also be
made available when the system is in a power-down state.
IPMI works by interfacing with the BMC, which extends management capabilities in the server
system and operates independently of the main processor by monitoring the on-board
instrumentation. Through the BMC, IPMI also allows administrators to control power to the
server, and remotely access BIOS configuration and operating system console information.
IPMI defines a common platform instrumentation interface to enable interoperability between:
The baseboard management controller and chassis
The baseboard management controller and systems management software
Between servers
IPMI enables the following:
Common access to platform management information, consisting of:
- Local access from systems management software
- Remote access from LAN
- Inter-chassis access from Intelligent Chassis Management Bus
- Access from LAN, serial/modem, IPMB, PCI SMBus*, or ICMB, available even if the
processor is down
IPMI interface isolates systems management software from hardware.
Hardware advancements can be made without impacting the systems management
You can find more information on IPMI at the following URL:
http://www.intel.com/design/servers/ipmi
1.2.2 Baseboard Management Controller (BMC)
A baseboard management controller (BMC) is a specialized microcontroller embedded on most
Intel® Server Boards. The BMC is the heart of the IPMI architecture and provides the
intelligence behind intelligent platform management, that is, the autonomous monitoring and
recovery features implemented directly in platform management hardware and firmware.
Different types of sensors built into the computer system report to the BMC on parameters such
as temperature, cooling fan speeds, power mode, operating system status, and so on. The BMC
monitors the system for critical events by communicating with various sensors on the system
2 Intel order number G90620-002 Revision 1.1
Page 13
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
4600/2600/2400/1600/1400 Product Families Introduction
board; it sends alerts and logs events when certain parameters exceed their preset thresholds,
indicating a potential failure of the system. The administrator can also remotely communicate
with the BMC to take some corrective action such as resetting or power cycling the system to
get a hung OS running again. These abilities save on the total cost of ownership of a system.
For Intel® Server Boards and Intel® Server Platforms, the BMC supports the industry standard
IPMI 2.0 Specification, enabling you to configure, monitor, and recover systems remotely.
1.2.2.1 System Event Log (SEL)
The BMC provides a centralized, non-volatile repository for critical, warning, and informational
system events called the System Event Log or SEL. By having the BMC manage the SEL and
logging functions, it helps to ensure that “post-mortem” logging information is available if a
failure occurs that disables the system processor(s).
The BMC allows access to SEL from in-band and out-of-band mechanisms. There are various
tools and utilities that can be used to access the SEL. There is the Intel® SELView utility and
multiple open sourced IPMI tools.
1.2.3 Intel
®
Intelligent Power Node Manager Version 2.0
Intel® Intelligent Power Node Manager Version 2.0 (NM) is a platform-resident technology that
enforces power and thermal policies for the platform. These policies are applied by exploiting
subsystem knobs (such as processor P and T states) that can be used to control power
consumption. Intel® Intelligent Power Node Manager enables data center power and thermal
management by exposing an external interface to management software through which platform
policies can be specified. It also enables specific data center power management usage models
such as power limiting.
The configuration and control commands are used by the external management software or
BMC to configure and control the Intel® Intelligent Power Node Manager feature. Because
Platform Services firmware does not have any external interface, external commands are first
received by the BMC over LAN and then relayed to the Platform Services firmware over IPMB
channel. The BMC acts as a relay and the transport conversion device for these commands. For
simplicity, the commands from the management console might be encapsulated in a generic
CONFIG packet format (configuration data length, configuration data blob) to the BMC so that
the BMC doesn’t even have to parse the actual configuration data.
The BMC provides the access point for remote commands from external management SW and
generates alerts to them. Intel® Intelligent Power Node Manager on Intel® Manageability Engine
(Intel® ME) is an IPMI satellite controller. A mechanism exists to forward commands to Intel® ME
and then sends the response back to originator. Similarly events from Intel® ME will be sent as
alerts outside of the BMC.
Revision 1.1 Intel order number G90620-002 3
Page 14
Basic Decoding of a SEL Record
Byte
Field
Description
1 2 Record ID
(RID)
ID used for SEL Record access.
3
Record Type
(RT)
[7:0] – Record Type
02h = System event record
C0h-DFh = OEM timestamped, bytes 8-16 OEM defined (See Table 3)
E0h-FFh = OEM non-timestamped, bytes 4-16 OEM defined (See Table 4)
4
5
6
7
Timestamp
(TS)
Time when event was logged. LS byte first.
Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010
23:20:09 UTC
Note: There are various websites that will convert the raw number to a date/time.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
2. Basic Decoding of a SEL Record
The System Event Log (SEL) record format is defined in the IPMI Specification. The following section provides a basic definition for
each of the fields in a SEL. For more details see the IPMI Specification.
The definitions for the standard SEL can be found in Table 1.
The definitions for the OEM defined event logs can be found in Table 3 and Table 4.
2.1 Default Values in the SEL Records
Unless otherwise noted in the event record descriptions the following are the default values in all SEL entries.
Byte [3] = Record Type (RT) = 02h = System event record
Byte [9:8] = Generator ID = 0020h = BMC Firmware
Byte [10] = Event Message Revision (ER) = 04h = IPMI 2.0
4 Intel order number G90620-002 Revision 1.1
Table 1. SEL Record Format
Page 15
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
8 9 Generator ID
(GID)
RqSA and LUN if event was generated from IPMB.
Software ID if event was generated from system software.
Byte 1
[7:1] – 7-bit I2C Slave Address, or 7-bit system software ID
[0] 0b = ID is IPMB Slave Address
1b = System software ID
Software ID values:
0001h – BIOS POST for POST errors, RAS Configuration/State,
Timestamp Synch, OS Boot events
0033h – BIOS SMI Handler
0020h – BMC Firmware
002Ch – ME Firmware
0041h – Server Management Software
00C0h – HSC Firmware – HSBP A
00C2h – HSC Firmware – HSBP B
Byte 2
[7:4] – Channel number. Channel that event message was received over. 0h if the event
message was received from the system interface, primary IPMB, or internally generated
by the BMC.
[3:2] – Reserved. Write as 00b.
[1:0] – IPMB device LUN if byte 1 holds Slave Address. 00b otherwise.
Sensor Type Code for sensor that generated the event
12
Sensor #
(SN)
Number of sensor that generated the event (From SDR)
13
Event Dir |
Event Type
(EDIR)
Event Dir
[7] – 0b = Assertion event.
1b = Deassertion event.
Event Type
Type of trigger for the event, for example, critical threshold going high, state asserted,
and so on. Also indicates class of the event. For example, discrete, threshold, or OEM.
The Event Type field is encoded using the Event/Reading Type Code.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
6 Intel order number G90620-002 Revision 1.1
Page 17
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Class
Event Data
Threshold
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
01b = Trigger reading in Event Data 2
10b = OEM code in Event Data 2
11b = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
01b = Trigger threshold value in Event Data 3
10b = OEM code in Event Data 3
11b = Sensor-specific event extension code in Event Data 3
[3:0] – Offset from Event/Reading Code for threshold event.
Event Data 2 – Reading that triggered event, FFh or not present if unspecified.
Event Data 3 – Threshold value that triggered event, FFh or not present if unspecified. If present, Event Data 2 must be present.
discrete
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
01b = Previous state and/or severity in Event Data 2
10b = OEM code in Event Data 2
11b = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
01b = Reserved
10b = OEM code in Event Data 3
11b = Sensor-specific event extension code in Event Data 3
[3:0] – Offset from Event/Reading Code for discrete event state
Event Data 2
[7:4] – Optional offset from “Severity” Event/Reading Code (0Fh if unspecified).
[3:0] – Optional offset from Event/Reading Type Code for previous discrete event state (0Fh if unspecified).
Event Data 3 – Optional OEM code. FFh or not present if unspecified.
OEM
Event Data 1
[7:6] – 00b = Unspecified in Event Data 2
01b = Previous state and/or severity in Event Data 2
10b = OEM code in Event Data 2
Basic Decoding of a SEL Record
Table 2: Event Request Message Event Data Field Contents
Revision 1.1 Intel order number G90620-002 7
Page 18
Basic Decoding of a SEL Record
Sensor
Class
Event Data
11b = Reserved
[5:4] – 00b = Unspecified Event Data 3
01b = Reserved
10b = OEM code in Event Data 3
11b = Reserved
[3:0] – Offset from Event/Reading Type Code
Event Data 2
[7:4] – Optional OEM code bits or offset from “Severity” Event/Reading Type Code (0Fh if unspecified).
[3:0] – Optional OEM code or offset from Event/Reading Type Code for previous event state (0Fh if unspecified).
Event Data 3 – Optional OEM code. FFh or not present if unspecified.
Byte
Field
Description
1 2 Record ID
(RID)
ID used for SEL Record access.
3
Record Type
(RT)
[7:0] – Record Type
C0h-DFh = OEM timestamped, bytes 8-16 OEM defined
4
5
6
7
Timestamp
(TS)
Time when event was logged. LS byte first.
Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010
23:20:09 UTC
Note: There are various websites that will convert the raw number to a date/time.
8
9
10
Manufacturer ID
LS Byte first. The manufacturer ID is a 20-bit value that is derived from the IANA
“Private Enterprise” ID.
Most significant four bits = Reserved (0000b).
000000h = Unspecified. 0FFFFFh = Reserved.
This value is binary encoded.
For example the ID for the IPMI forum is 7154 decimal, which is 1BF2h, which will be
stored in this record as F2h, 1Bh, 00h for bytes 8 through 10, respectively.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Table 3: OEM SEL Record (Type C0h-DFh)
8 Intel order number G90620-002 Revision 1.1
Page 19
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
11
12
13
14
15
16
OEM Defined
OEM Defined. This is defined according to the manufacturer identified by the
Manufacturer ID field.
Byte
Field
Description
1 2 Record ID
(RID)
ID used for SEL Record access.
3
Record Type
(RT)
[7:0] – Record Type
E0h-FFh = OEM system event record
4
5
6
7
8
9
10
11
12
13
14
15
16
OEM
OEM Defined. This is defined by the system integrator.
Basic Decoding of a SEL Record
Table 4: OEM SEL Record (Type E0h-FFh)
Revision 1.1 Intel order number G90620-002 9
Page 20
Basic Decoding of a SEL Record
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
2.2 Notes on SEL Logs and Collecting SEL Information
Whenever you capture the SEL log, you should always collect both the text/human readable version and the hex version. Because
some of the data is OEM-specific, some utilities cannot decode the information correctly. In addition with some OEM-specific data
there may be additional variables that are not decoded at all.
An example of not decoding all of the information is the BIOS timestamp synchronization event log. This event can be logged by the
BIOS during POST or it can be logged by the BIOS SMI Handler when a system is requested to do a shutdown or a restart from the
operating system (OS). See section 2.2.1 for examples. Most utilities report this as just a BIOS event and do not differentiate
between the two. But sometimes it is useful because you can see the sequence of events better. For example if there are multiple
sequences of the timestamp synchronization events, was the power lost after booting to the OS and then the system restarted, was it
multiple POST events, or was it a restart from the OS?
An example of not decoding all the information is with the PCI Express* errors and some of the Power Supply events. For the PCI
Express* errors the type of error and the PCI Bus, Device, and Function are all a part of Event Data 1 through Event Data 3. See
section 2.2.2. For the Power Supply events when there is a failure, predictive failure, or a configuration error, Event Data 2 and Event
Data 3 hold additional information that describes the Power Supplies PMBus* Command Registers and values for that particular
event. See section 2.2.3.
2.2.1 Examples of Decoding BIOS Timestamp Events
The following are some samples of BIOS timestamp events during POST and during an OS shutdown.
RID (Record ID) = 001Fh
RT (Record Type) = 02h = system event record
TS (Timestamp) = 4F8D70C3h
GID (Generator ID = 0033h = BIOS SMI Handler
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 83h
EDIR (Event Direction/Event Type) = 6Fh; [7] = 0 = Assertion Event
[6:0] = 6fh = Sensor specific
ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization
ED2 (Event Data 2) = 00h = First in pair
RID (Record ID) = 0020h
RT (Record Type) = 02h = system event record
TS (Timestamp) = 4F8D70C4h
GID (Generator ID = 0033h = BIOS SMI Handler
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 83h
EDIR (Event Direction/Event Type) = 6fh; [7] = 0 = Assertion Event
[6:0] = 6fh = Sensor specific
ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization
ED2 (Event Data 2) = 00h = First in pair
2.2.2 Example of Decoding a PCI Express* Correctable Error Events
The following is an example of decoding a PCI Express* correctable error event. For this particular event it recorded a receiver error
on Bus 0, Device 2, and Function 2. Note that correctable errors are acceptable and normal at a low rate of occurrence.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
RT (Record Type) = 02h = system event record
TS (Timestamp) = 502E9B0Ah
GID (Generator ID = 0033h = BIOS SMI Handler
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 13h = Critical Interrupt (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 05h
EDIR (Event Direction/Event Type) = 71h; [7] = 0 = Assertion Event
ED1 (Event Data 1) = A0h; [7:6] = 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 0h = Receiver Error
ED2 (Event Data 2) = 00h; PCI Bus number = 0
ED3 (Event Data 3) = 12h; [7:3] – PCI Device number = 02h
[2:0] – PCI Function number = 2
[6:0] = 71h = OEM Specific for PCI Express* correctable errors
2.2.3 Example of Decoding a Power Supply Predictive Failure Event
The following is an example of decoding a Power Supply predictive failure event. For this example power supply 1 saw an A/C power
loss event with both the input under-voltage warning and fault events getting set. In most cases this means that the A/C power spiked
under the minimum warning and fault thresholds for over 20 milliseconds but the system remained powered on. If these events
continue to occur, it is advisable to check your power source.
Table 105: Linux* Kernel Panic Event Record Characteristics
Not applicable
F0h
Not applicable
Table 106: Linux* Kernel Panic String Extended Record Characteristics
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
3.5 Microsoft* OS owned Events (GID = 0041)
The following table can be used to find the details of records that are owned by the Microsoft* Operating System (OS).
Table 9: Microsoft* OS owned Events
3.6 Linux* Kernel Panic Events (GID = 0021)
The following table can be used to find the details of records that can be generated when there is a Linux* Kernel panic.
26 Intel order number G90620-002 Revision 1.1
Table 10: Linux* Kernel Panic Events
Page 37
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
11
Sensor Type
02h = Voltage
12
Sensor Number
See Table 13
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Triggers as described in Table 12
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
Power Subsystems
4. Power Subsystems
The BMC monitors the power subsystem including power supplies, select onboard voltages, and related sensors.
4.1 Threshold-based Voltage Sensors
The BMC monitors the main voltage sources in the system, including the baseboard, memory, and processors, using IPMI-compliant
analog/threshold sensors. Some voltages are only on specific platforms. For details check your platforms Technical Product Specification (TPS).
Note: A voltage error can be caused by the device supplying the voltage or by the device using the voltage. For each sensor it will be
noted who is supplying the voltage and who is using it.
Table 11: Threshold-based Voltage Sensors Typical Characteristics
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Revision 1.1 Intel order number G90620-002 27
Page 38
Power Subsystems
Event Trigger
Assertion
Severity
Deassert
Severity
Description
Hex
Description
00h
Lower non-critical
going low
Degraded
OK
The voltage has dropped below its lower non-critical threshold.
02h
Lower critical
going low
non-fatal
Degraded
The voltage has dropped below its lower critical threshold.
07h
Upper non-critical
going high
Degraded
OK
The voltage has gone over its upper non-critical threshold.
09h
Upper critical
going high
non-fatal
Degraded
The voltage has gone over its upper critical threshold.
This 1.05V line is supplied by the main board.
This 1.05V line is used by processor 2.
1. Ensure all cables are connected correctly.
2. Check the processor is seated properly.
3. Cross test the processors. If the issue remains with the processor socket, replace the main board,
otherwise the processor.
D8h
Baseboard +1.5V P1 Memory AB
VDDQ
(BB +1.5 P1MEM AB)
This 1.5V line is supplied by the main board.
This 1.5V line is used by processor 1 memory slots A and B.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
D9h
Baseboard +1.5V P1 Memory CD
VDDQ
(BB +1.5 P1MEM CD)
This 1.5V line is supplied by the main board.
This 1.5V line is used by processor 1 memory slots C and D.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
30 Intel order number G90620-002 Revision 1.1
Page 41
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Number
Sensor Name
Next Steps
DAh
Baseboard +1.5V P2 Memory AB
VDDQ
(BB +1.5 P2MEM AB)
This 1.5V line is supplied by the main board.
This 1.5V line is used by processor 2 memory slots A and B.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
DBh
Baseboard +1.5V P2 Memory CD
VDDQ
(BB +1.5 P2MEM CD)
This 1.5V line is supplied by the main board.
This 1.5V line is used by processor 2 memory slots C and D.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
DCh
Baseboard +1.8V Aux
(BB +1.8V AUX)
+1.8V AUX is supplied by the main board.
+1.8V AUX is used by the BMC and on-board NIC.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
DDh
Baseboard +1.1V Stand-by
(BB +1.1V STBY)
+1.1V STBY is supplied by the main board.
+1.1V STBY is used by the Intel® C600 series Chipset.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
DEh
Baseboard CMOS Battery
(BB +3.3V Vbat)
+3.3V Vbat is supplied by the CMOS battery when power is off and by the main board when power is on.
+3.3V Vbat is used by the CMOS and related circuits.
1. Replace the CMOS battery. Any battery of type CR2032 can be used.
2. If error remains (unlikely), replace the board.
E4h
Baseboard +1.35V P1 Low Voltage
Memory AB VDDQ
(BB +1.35 P1LV AB)
This 1.35V line is supplied by the main board.
This 1.35V line is used by processor 1 memory slots A and B.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
Power Subsystems
Revision 1.1 Intel order number G90620-002 31
Page 42
Power Subsystems
Sensor
Number
Sensor Name
Next Steps
E5h
Baseboard +1.35V P1 Low Voltage
Memory CD VDDQ
(BB +1.35 P1LV CD)
This 1.35V line is supplied by the main board.
This 1.35V line is used by processor 1 memory slots C and D.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
E6h
Baseboard +1.35V P2 Low Voltage
Memory AB VDDQ
(BB +1.35 P2LV AB)
This 1.35V line is supplied by the main board.
This 1.35V line is used by processor 2 memory slots A and B.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
E7h
Baseboard +1.35V P2 Low Voltage
Memory CD VDDQ
(BB +1.35 P2LV CD)
This 1.35V line is supplied by the main board.
This 1.35V line is used by processor 2 memory slots C and D.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test the DIMMs. If the issue remains with the DIMMs on this socket, replace the main
board, otherwise the DIMM.
EAh
Baseboard +3.3V Riser 1 Power Good
(BB +3.3 RSR1 PGD)
+3.3V Riser 1 Power Good is supplied by Riser 1 on specific platforms.
+3.3V Riser 1 Power Good is an indication of the +3.3V on Riser 1.
1. Ensure that the riser is seated correctly.
2. If issue remains, replace the riser.
3. If issue remains, replace the main board.
4. If the issue remains, replace the power supplies.
EBh
Baseboard +3.3V Riser 2 Power Good
(BB +3.3 RSR2 PGD)
+3.3V Riser 2 Power Good is supplied by Riser 2 on specific platforms.
+3.3V Riser 2 Power Good is an indication of the +3.3V on Riser 2.
1. Ensure that the riser is seated correctly.
2. If issue remains, replace the riser.
3. If issue remains, replace the main board.
4. If the issue remains, replace the power supplies.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
32 Intel order number G90620-002 Revision 1.1
Page 43
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Number
Sensor Name
Next Steps
ECh
Baseboard +0.9V
(BB 0.9V Core IB)
+0.9V Core IB is supplied by the main board on specific platforms.
+0.9V Core IB is used by the on-board Infiniband* controller on those specific platforms.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
EDh
Baseboard +1.8V
(BB 1.8V IB I/O)
+1.8V IB I/O is supplied by the main board on specific platforms.
+1.8V IB I/O is used by the on-board Infiniband* controller on those specific platforms.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
EEh
Baseboard +1.1V
(BB 1.1V PCH)
This 1.1V line is supplied by the main board.
This 1.1V line is used by the Intel® C600 series Chipset.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
EFh
Baseboard +1.2V
(BB +1.2V IB)
+1.2V is supplied by the main board on specific platforms.
+1.2V is used by the on-board Infiniband* controller on those specific platforms.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
Power Subsystems
4.2 Voltage Regulator Watchdog Timer Sensor
The BMC FW monitors that the power sequence for the board VR controllers is completed when a DC power-on is initiated.
Incompletion of the sequence indicates a board problem, in which case the FW powers down the system.
The sequence is as follows:
BMC FW monitors the PowerSupplyPowerGood signal for assertion, indicating a DC-power-on has been initiated, and starts a
timer (VR Watchdog Timer). For EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600 Product Families
this timeout is 500ms.
Revision 1.1 Intel order number G90620-002 33
Page 44
Power Subsystems
Byte
Field
Description
11
Sensor Type
02h = Voltage
12
Sensor Number
0Bh
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (“digital” Discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h = State Asserted
15
Event Data 2
Not used
16
Event Data 3
Not used
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
If the SystemPowerGood signal has not asserted by the time the VR Watchdog Timer expires, the FW powers down the system,
logs a SEL entry, and emits a beep code (1-5-1-2). This failure is termed as VR Watchdog Timeout.
Table 14: Voltage Regulator Watchdog Timer Sensor Typical Characteristics
4.2.1 Voltage Regulator Watchdog Timer Sensor – Next Steps
1. Ensure that all the connectors from the power supply are well seated.
2. Cross test the baseboard. If the issue remains with the baseboard, replace the baseboard.
4.3 Power Unit
The power unit monitors the power state of the system and logs the state changes in the SEL.
4.3.1 Power Unit Status Sensor
The power unit status sensor monitors the power state of the system and logs state changes. Expected power-on events such as DC
ON/OFF is logged and unexpected events are also logged, such as AC loss and power good loss.
34 Intel order number G90620-002 Revision 1.1
Page 45
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
11
Sensor Type
09h = Power Unit
12
Sensor Number
01h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] = Sensor Specific offset as described in Table 16
15
Event Data 2
Not used
16
Event Data 3
Not used
Sensor Specific Offset
Description
Next Steps
Hex
Description
00h
Power down
System is powered down.
Informational Event
02h
240 VA power down
240 VA power limit was exceeded and the
hardware forced a power down.
This could have been caused by many things.
1. If you recently added hardware, try removing it.
2. Remove/replace any add-in adapters.
3. Remove/replace the power supply.
4. Remove/replace the processors, DIMM, and/or hard drives.
5. Remove/replace the boards in the system.
04h
A/C Lost
A/C power was removed.
Informational Event
Table 15: Power Unit Status Sensors Typical Characteristics
Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps
Power Subsystems
Revision 1.1 Intel order number G90620-002 35
Page 46
Power Subsystems
Sensor Specific Offset
Description
Next Steps
Hex
Description
05h
Soft Power Control
Failure
Generally means power good was lost in
the system, causing a shutdown.
This could be cause by the power supply subsystem or system
components.
1. Verify all power cables and adapters are connected properly (AC
cables as well as the cables between the PSU and system
components).
2. Cross test the PSU if possible.
3. Replace the power subsystem.
06h
Power Unit Failure
Power subsystem experienced a failure.
Indicates a power supply failed.
1. Remove and reapply AC power.
2. If the power supply still fails, replace it.
Byte
Field
Description
11
Sensor Type
09h = Power Unit
12
Sensor Number
02h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 18
15
Event Data 2
Not used
16
Event Data 3
Not used
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
4.3.2 Power Unit Redundancy Sensor
This sensor is enabled on the systems that support redundant power supplies. When a system has AC applied or if it loses
redundancy of the power supplies, a message will get logged into the SEL.
Table 17: Power Unit Redundancy Sensors Typical Characteristics
36 Intel order number G90620-002 Revision 1.1
Page 47
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Event Trigger Offset
Description
Next Steps
Hex
Description
00h
Fully redundant
System is fully operational.
Informational Event
01h
Redundancy lost
System is not running in
redundant power supply mode.
This event is accompanied by specific power supply errors
(AC lost, PSU failure, and so on). Troubleshoot these events
accordingly.
02h
Redundancy degraded
03h
Non-redundant, sufficient from redundant
04h
Non-redundant, sufficient from insufficient
05h
Non-redundant, insufficient
06h
Non-redundant, degraded from fully redundant
07h
Redundant, degraded from non-redundant
Byte
Field
Description
11
Sensor Type
09h = Power Unit
12
Sensor Number
B8h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
Power Subsystems
Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps
4.3.3 Node Auto Shutdown Sensor
The BMC supports a Node Auto Shutdown sensor for logging a SEL event due to an emergency shutdown of a node due to loss of
power supply redundancy or PSU CLST throttling due to an over-current warning condition. This sensor is applicable only to multinode systems.
The sensor is rearmed on power-on (AC or DC power on transitions).
This sensor is only used for triggering SEL to indicate node or power auto shutdown assertion or deassertion.
Revision 1.1 Intel order number G90620-002 37
Table 19: Node Auto Shutdown Sensor Typical Characteristics
50h = Power Supply 1 Status
51h = Power Supply 2 Status
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
4.3.3.1 Node Auto Shutdown Sensor – Next Steps
This event is accompanied by specific power supply errors (AC lost, PSU failure, and so on) or other system events. Troubleshoot
these events accordingly.
4.4 Power Supply
The BMC monitors the power supply subsystem.
4.4.1 Power Supply Status Sensors
These sensors report the status of the power supplies in the system. When a system first AC applied or removed , it can log an event.
Also if there is a failure, predictive failure, or a configuration error, it can log an event.
38 Intel order number G90620-002 Revision 1.1
Table 20: Power Supply Status Sensors Typical Characteristics
Page 49
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – ED2 data in Table 21
[5:4] – ED3 data in Table 21
[3:0] – Sensor Specific offset as described in Table 21
15
Event Data 2
As described in Table 21
16
Event Data 3
As described in Table 21
Sensor Specific Offset
Description
ED2
ED3
Next Steps
Hex
Description
00h
Presence
Power supply detected
00b = Unspecified Event Data 2
00b = Unspecified Event Data 3
Informational Event
01h
Failure
Power supply failed
Check the data in ED2
and ED3 for more details.
10b = OEM code in Event Data 2
01h – Output voltage fault
02h – Output power fault
03h – Output over-current fault
04h – Over-temperature fault
05h – Fan fault
10b = OEM code in Event Data 3
Will have the contents of the
associated PMBus* Status
register. For example, Data 3 will
have the contents of the
VOLTAGE_STATUS register at
the time an Output Voltage fault
was detected. Refer to the
PMBus* Specification for details
on specific register contents.
Indicates a power supply
failed.
1. Remove and reapply
AC.
2. If the power supply
still fails, replace it.
Table 21: Power Supply Status Sensor – Sensor Specific Offsets – Next Steps
Power Subsystems
Revision 1.1 Intel order number G90620-002 39
Page 50
Power Subsystems
Sensor Specific Offset
Description
ED2
ED3
Next Steps
Hex
Description
02h
Predictive
Failure
Check the data in ED2
and ED3 for more details.
10b = OEM code in Event Data 2
01h – Output voltage warning
02h – Output power warning
03h – Output over-current
Will have the contents of the
associated PMBus* Status
register. For example, Data 3 will
have the contents of the
VOLTAGE_STATUS register at
the time an Output Voltage
warning was detected. Refer to
the PMBus* Specification for
details on specific register
contents
Depends on the warning
event.
1. Replace the power
supply.
2. Verify proper airflow
to the system.
3. Verify the power
source.
4. Replace the system
boards.
03h
A/C lost
AC removed
00b = Unspecified Event Data 2
00b = Unspecified Event Data 3
Informational Event.
06h
Configuration
error
Power supply
configuration is not
supported.
Check the data in ED2 for
more details.
10b = OEM code in Event Data 2
01h – The BMC cannot access
the PMBus* device on the PSU
but its FRU device is
responding.
02h – The PMBUS*_REVISION
command returns a version
number that is not supported
(only version 1.1 and 1.2 are
supported).
03h – The PMBus* device does
not successfully respond to the
PMBUS*_REVISION command.
04h – The PSU is incompatible
with one or more PSUs that are
present in the system.
05h –The PSU FW is operating
in a degraded mode (likely due
to a failed firmware update).
00b = Unspecified Event Data 3
Indicates that at least one of
the supplies is not correct for
your system configuration.
1. Remove the power
supply and verify
compatibility.
2. If the power supply is
compatible, it may be
faulty. Replace it.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
40 Intel order number G90620-002 Revision 1.1
Page 51
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
11
Sensor Type
0Bh = Other Units
12
Sensor Number
54h = Power Supply 1 Status
55h = Power Supply 2 Status
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h(Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 23
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
Event Trigger Offset
Assertion
Severity
Deassert
Severity
Description
Next Steps
Hex
Description
07h
Upper non-critical
going high
Degraded
OK
PMBus* feature to monitor power
supply power consumption.
If you see this event, the system is pulling too much power on the
input for the PSU rating.
1. Verify the power budget is within the specified range.
2. Check http://www.intel.com/p/en_US/support/ for the
power budget tool for your system.
09h
Upper critical
going high
non-fatal
Degraded
4.4.2 Power Supply Power In Sensors
These sensors will log an event when a power supply in the system is exceeding its AC power in threshold.
Table 22: Power Supply Power In Sensors Typical Characteristics
Power Subsystems
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 23: Power Supply Power In Sensor – Event Trigger Offset – Next Steps
Revision 1.1 Intel order number G90620-002 41
Page 52
Power Subsystems
Byte
Field
Description
11
Sensor Type
03h = Current
12
Sensor Number
58h = Power Supply 1 Current Out %
59h = Power Supply 2 Current Out %
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 25
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
Event Trigger Offset
Assertion
Severity
Deassert
Severity
Description
Next Steps
Hex
Description
07h
Upper non-critical
going high
Degraded
OK
PMBus* feature to monitor power
supply power consumption.
If you see this event, the system is using too much power on the
output for the PSU rating.
1. Verify the power budget is within the specified range.
2. Check http://www.intel.com/p/en_US/support/ for the
power budget tool for your system.
09h
Upper critical
going high
non-fatal
Degraded
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
4.4.3 Power Supply Current Out % Sensors
PMBus*-compliant power supplies may monitor the current output of the main 12v voltage rail and report the current usage as a
percentage of the maximum power output for that rail.
Table 24: Power Supply Current Out % Sensors Typical Characteristics
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 25: Power Supply Current Out % Sensor – Event Trigger Offset – Next Steps
42 Intel order number G90620-002 Revision 1.1
Page 53
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
5Ch = Power Supply 1 Temperature
5Dh = Power Supply 2 Temperature
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 27
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
Event Trigger Offset
Assertion
Severity
Deassert
Severity
Description
Next Steps
Hex
Description
07h
Upper non-critical
going high
Degraded
OK
An upper non-critical or
critical temperature
threshold has been
crossed.
1. Check for clear and unobstructed airflow into and out of the chassis.
2. Ensure SDR is programmed and correct chassis has been selected.
3. Ensure there are no fan failures.
4. Ensure the air used to cool the system is within the thermal
specifications for the system (typically below 35°C).
09h
Upper critical going
high
non-fatal
Degraded
4.4.4 Power Supply Temperature Sensors
The BMC monitors one or two power supply temperature sensors for each installed PMBus*-compliant power supply.
Table 26: Power Supply Temperature Sensors Typical Characteristics
Power Subsystems
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 27: Power Supply Temperature Sensor – Event Trigger Offset – Next Steps
Revision 1.1 Intel order number G90620-002 43
Page 54
Power Subsystems
Byte
Field
Description
11
Sensor Type
04h = Fan
12
Sensor Number
A0h = Power Supply 1 Fan Tachometer 1
A1h = Power Supply 1 Fan Tachometer 2
A4h = Power Supply 2 Fan Tachometer 1
A5h = Power Supply 2 Fan Tachometer 2
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (“digital” Discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = 1h = State Asserted
15
Event Data 2
Not used
16
Event Data 3
Not used
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
4.4.5 Power Supply Fan Tachometer Sensors
The BMC polls each installed power supply using the PMBus* fan status commands to check for failure conditions for the power
supply fans.
Table 28: Power Supply Fan Tachometer Sensors Typical Characteristics
4.4.5.1 Power Supply Fan Tachometer Sensors – Next Steps
These events only get generated in the systems with PMBus*-capable power supplies and normally when the airflow is obstructed to
the power supply:
1. Remove and then reinstall the power supply to see whether something might have temporarily caused the fan failure.
2. Swap the power supply with another one to see whether the problem stays with the location or follows the power supply.
3. Replace the power supply depending on the outcome of steps 1 and 2.
4. Ensure the latest FRUSDR update has been run and the correct chassis is detected or selected.
44 Intel order number G90620-002 Revision 1.1
Page 55
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 30
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
Cooling Subsystem
5. Cooling Subsystem
5.1 Fan Sensors
There are three types of fan sensors that can be present on Intel® Server Systems: speed, presence, and redundancy. The last two
are only present in the systems with hot-swap redundant fans.
5.1.1 Fan Tachometer Sensors
Fan tachometer sensors monitor the rpm signal on the relevant fan headers on the platform. Fan speed sensors are threshold-based
sensors. Usually they only have lower (critical) thresholds set, so that a SEL entry is only generated if the fan spins too slowly.
Table 29: Fan Tachometer Sensors Typical Characteristics
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Revision 1.1 Intel order number G90620-002 45
Page 56
Cooling Subsystem
Event Trigger Offset
Assertion
Severity
Deassert
Severity
Description
Next Steps
Hex
Description
00h
Lower non-critical
going low
Degraded
OK
The fan speed has dropped
below its lower non-critical
threshold.
A fan speed error on a new system build is typically not caused by the fan
spinning too slowly, instead it is caused by the fan being connected to the
wrong header (the BMC expects them on certain headers for each
chassis and will log this event if there is no fan on that header).
1. Refer to the Quick Start Guide or the Service Guide to identify
the correct fan headers to use.
2. Ensure the latest FRUSDR update has been run and the correct
chassis is detected or selected.
3. If you are sure this was done, the event may be a sign of
impending fan failure (although this only normally applies if the
system has been in use for a while). Replace the fan.
02h
Lower critical
going low
non-fatal
Degraded
The fan speed has dropped
below its lower critical
threshold.
Byte
Field
Description
11
Sensor Type
04h = Fan
12
Sensor Number
40h-4Fh (Chassis specific)
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 08h (Generic “digital” Discrete)
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Table 30: Fan Tachometer Sensor – Event Trigger Offset – Next Steps
5.1.2 Fan Presence and Redundancy Sensors
Fan presence sensors are only implemented for hot-swap fans, and require an additional pin on the fan header. Fan redundancy is
an aggregate of the fan presence sensors and will warn when redundancy is lost. Typically the redundancy mode on Intel® servers is
an n+1 redundancy (if one fan fails there are still sufficient fans to cool the system, but it is no longer redundant) although other
modes are also possible.
Table 31: Fan Presence Sensors Typical Characteristics
46 Intel order number G90620-002 Revision 1.1
Page 57
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 32
15
Event Data 2
Not used
16
Event Data 3
Not used
Event Trigger Offset
Assertion
Severity
Deassert
Severity
Description
Next Steps
Hex
Description
01h
Device
Present
OK
Degraded
Assertion –A fan was inserted. This
event may also get logged when the
BMC initializes when AC is applied.
Informational only
Deassert – A fan was removed, or
was not present at the expected
location when the BMC initialized.
These events only get generated in the systems with hot-swappable fans,
and normally only when a fan is physically inserted or removed. If fans
were not physically removed:
1. Use the Quick Start Guide to check whether the right fan
headers were used.
2. Swap the fans round to see whether the problem stays with the
location or follows the fan.
3. Replace the fan or fan wiring/housing depending on the outcome
of step 2.
4. Ensure the latest FRUSDR update has been run and the correct
chassis is detected or selected.
Byte
Field
Description
11
Sensor Type
04h = Fan
12
Sensor Number
0Ch
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 32: Fan Presence Sensors – Event Trigger Offset – Next Steps
Cooling Subsystem
Revision 1.1 Intel order number G90620-002 47
Table 33: Fan Redundancy Sensors Typical Characteristics
Page 58
Cooling Subsystem
Byte
Field
Description
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 34
15
Event Data 2
Not used
16
Event Data 3
Not used
Event Trigger Offset
Description
Next Steps
Hex
Description
00h
Fully redundant
The system has lost one or more fans and is running in nonredundant mode. There are enough fans to keep the system
properly cooled, but fan speeds will boost.
Fan redundancy loss indicates failure of
one or more fans.
Look for lower (non-) critical fan errors,
or fan removal errors in the SEL, to
indicate which fan is causing the
problem, and follow the troubleshooting
steps for these event types.
01h
Redundancy lost
02h
Redundancy degraded
03h
Non-redundant, sufficient from redundant
04h
Non-redundant, sufficient from insufficient
05h
Non-redundant, insufficient
The system has lost fans and may no longer be able to cool
itself adequately. Overheating may occur if this situation
remains for a longer period of time.
06h
Non-redundant, degraded from fully
redundant
The system has lost one or more fans and is running in nonredundant mode. There are enough fans to keep the system
properly cooled, but fan speeds will boost.
07h
Redundant, degraded from non-redundant
The system has lost one or more fans and is running in a
degraded mode, but still is redundant. There are enough fans
to keep the system properly cooled.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
The following table describes the severity of each of the event triggers for both assertion and deassertion.
Table 34: Fan Redundancy Sensor – Event Trigger Offset – Next Steps
48 Intel order number G90620-002 Revision 1.1
Page 59
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
See Table 37
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Trigger Offset as described in Table 36
15
Event Data 2
Reading that triggered event
Cooling Subsystem
5.2 Temperature Sensors
There are a variety of temperature sensors that can be implemented on Intel® Server Systems. They are split into various types each
with their own events that can be logged.
Threshold-based temperature sensors are sensors that report an actual temperature. These are linear, threshold-based sensors. In
most Intel® Server Systems, multiple sensors are defined: front panel temperature and baseboard temperature. There are also
multiple other sensors that can be defined and are platform-specific. Most of these sensors typically have upper and lower thresholds
set – upper to warn in case of an over-temperature situation, lower to warn against sensor failure (temperature sensors typically read
out 0 if they stop working).
Revision 1.1 Intel order number G90620-002 49
Table 35: Temperature Sensors Typical Characteristics
Page 60
Cooling Subsystem
Byte
Field
Description
16
Event Data 3
Threshold value that triggered event
Event Trigger
Assertion
Severity
Deassert
Severity
Description
Hex
Description
00h
Lower non-critical
going low
Degraded
OK
The temperature has dropped below its lower non-critical threshold.
02h
Lower critical
going low
non-fatal
Degraded
The temperature has dropped below its lower critical threshold.
07h
Upper non-critical
going high
Degraded
OK
The temperature has gone over its upper non-critical threshold.
09h
Upper critical
going high
non-fatal
Degraded
The temperature has gone over its upper critical threshold.
Sensor
Number
Sensor Name
Next Steps
21h
Front Panel Temp
If the front panel temperature reads zero, check:
1. It is connected properly.
2. The SDR has been programmed correctly for your chassis.
If the front panel temperature is too high:
1. Check the cooling of your server room.
14h
Baseboard Temperature 5
1. Check for clear and unobstructed airflow into and out of the chassis.
2. Ensure the SDR is programmed and correct chassis has been selected.
3. Ensure there are no fan failures.
4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below
35°C).
15h
Baseboard Temperature 6
16h
I/O Mod2 Temp
17h
PCI Riser 5 Temp
18h
PCI Riser 4 Temp
20h
Baseboard Temperature 1
22h
SSB Temperature
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Table 36: Temperature Sensors Event Triggers – Description
Table 37: Temperature Sensors – Next Steps
50 Intel order number G90620-002 Revision 1.1
Page 61
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Number
Sensor Name
Next Steps
23h
Baseboard Temperature 2
24h
Baseboard Temperature 3
25h
Baseboard Temperature 4
26h
I/O Mod Temp
27h
PCI Riser 1 Temp
28h
IO Riser Temp
2Ch
PCI Riser 2 Temp
2Dh
SAS Mod Temp
2Eh
Exit Air Temp
2Fh
LAN NIC Temp
Byte
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
See Table 40
Cooling Subsystem
5.2.2 Thermal Margin Sensors
Margin sensors are also linear sensors but typically report a negative value. This is not an actual temperature, but in fact an offset to
a critical temperature. Values reported are seen as number of degrees below a critical temperature for the particular component.
The BMC supports DIMM aggregate temperature margin IPMI sensors. The temperature readings from the physical temperature
sensors on each DIMM (such as, Temperature Sensor on DIMM, or TSOD) are aggregated into IPMI temperature margin sensors for
groupings of DIMM slots, the partitioning of which is platform/SKU specific and generally corresponding to fan domains.
The BMC supports global aggregate temperature margin IPMI sensors. There may be as many unique global aggregate sensors as
there are fan domains. Each sensor aggregates the readings of multiple other IPMI temperature sensors supported by the BMC FW.
The mapping of child-sensors into each global aggregate sensor is SDR-configurable. The primary usage for these sensors is to
trigger turning off fans when a lower threshold is reached.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Number
Sensor Name
Next Steps
B3h
P2 DIMM Thrm Mrgn2
4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).
B4h
P3 DIMM Thrm Mrgn1
B5h
P3 DIMM Thrm Mrgn2
B6h
P4 DIMM Thrm Mrgn1
B7h
P4 DIMM Thrm Mrgn2
C8h
Agg Therm Mrgn 1
C9h
Agg Therm Mrgn 2
CAh
Agg Therm Mrgn 3
CBh
Agg Therm Mrgn 4
CCh
Agg Therm Mrgn 5
CDh
Agg Therm Mrgn 6
CEh
Agg Therm Mrgn 7
CFh
Agg Therm Mrgn 8
Byte
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
78h = Processor 1 Thermal Control %
79h = Processor 2 Thermal Control %
Cooling Subsystem
5.2.3 Processor Thermal Control Sensors
The BMC FW monitors the percentage of time that a processor has been operationally constrained over a given time window
(nominally six seconds) due to internal thermal management algorithms engaging to reduce the temperature of the device. This
monitoring is instantiated as one IPMI analog/threshold sensor per processor package.
If this is not addressed, the processor will overheat and shut down the system to protect itself from damage.
Table 41: Processor Thermal Control Sensors Typical Characteristics
Revision 1.1 Intel order number G90620-002 53
Page 64
Cooling Subsystem
Byte
Field
Description
7Ah = Processor 3 Thermal Control %
7Bh = Processor 4 Thermal Control %
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Triggers as described in Table 42
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
Event Trigger
Assertion
Severity
Deassert
Severity
Description
Hex
Description
07h
Upper non-critical
going high
Degraded
OK
The thermal margin has gone over its upper non-critical threshold.
09h
Upper critical
going high
non-fatal
Degraded
The thermal margin has gone over its upper critical threshold.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Table 42: Processor Thermal Control Sensors Event Triggers – Description
5.2.3.1 Processor Thermal Control % Sensors – Next Steps
These events normally occur due to failures of the thermal solution:
1. Verify heatsink is properly attached and has thermal grease.
2. If the system has a heatsink fan, ensure the fan is spinning.
3. Check all system fans are operating properly.
4. Check that the air used to cool the system is within limits (typically 35°C).
54 Intel order number G90620-002 Revision 1.1
Page 65
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Intel® Xeon® processor E5-4600/2600/2400/1600 v2 product families are incorporating a DTS-based thermal spec. This allows a
much more accurate control of the thermal solution and enables lower fan speeds and lower fan power consumption. For Intel®
Xeon® processor E5-4600/2600/2400/1600 product families, this requires significant BMC FW calculations to derive the sensor value.
Intel® Xeon® processor E5-4600/2600/2400/1600 v2 product families are the follow-on processors to Intel® Xeon® processor E54600/2600/2400/1600 product families. For Intel® Xeon® processor E5-4600/2600/2400/1600 v2 product families, the BMC’s
derivation of this value is greatly simplified because the majority of the calculations are performed within the processor itself.
The main usage of this sensor is as an input to the BMC’s fan control algorithms. The BMC implements this as a threshold sensor.
There is one DTS sensor for each installed physical processor package. Thresholds are not set and alert generation is not enabled
for these sensors.
Discrete thermal sensors do not report a temperature at all, instead they report an overheating event of some kind. For example,
VRD Hot (voltage regulator is overheating) or processor Thermal Trip (the processor got so hot that its over-temperature protection
was triggered and the system was shut down to prevent damage).
Revision 1.1 Intel order number G90620-002 55
Page 66
Cooling Subsystem
Byte
Field
Description
11
Sensor Type
01h = Temperature
12
Sensor Number
See Table 45
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = See Table 45
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 45
15
Event Data 2
Not used
16
Event Data 3
Not used
Sensor
Number
Sensor Name
Event
Type
Event Trigger Offset
Description
Next Steps
Hex
Description
0Dh
SSB Thermal Trip
03h
01h
State Asserted
South Side Bridge (SSB) overheated
1. Check for clear and unobstructed
airflow into and out of the chassis.
2. Ensure the SDR is programmed and
correct chassis has been selected.
3. Ensure there are no fan failures.
4. Ensure the air used for cooling the
system is within the thermal
specifications for the system (typically
below 35°C).
90h
P1 VRD Hot
05h
01h
Limit Exceeded
Processor 1 voltage regulator overheated
91h
P2 VRD Hot
Processor 2 voltage regulator overheated
92h
P3 VRD Hot
Processor 3 voltage regulator overheated
93h
P4 VRD Hot
Processor 4 voltage regulator overheated
94h
P1 Mem01 VRD Hot
Processor 1 Memory 0/1 voltage regulator
overheated
95h
P1 Mem23 VRD Hot
Processor 1 Memory 2/3 voltage regulator
overheated
96h
P2 Mem01 VRD Hot
Processor 2 Memory 0/1 voltage regulator
overheated
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
Cooling Subsystem
5.2.6 DIMM Thermal Trip Sensors
The BMC supports DIMM Thermal Trip monitoring that is instantiated as one aggregate IPMI discrete sensor per CPU. When a
DIMM Thermal Trip occurs, the system hardware will automatically power down the server and the BMC will assert the sensor offset
and log an event.
[3:0] – Event Trigger Offset = 0A = Critical over temperature
15
Event Data 2
Not used
16
Event Data 3
Not used
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
5.2.6.1 DIMM Thermal Trip Sensors – Next Steps
1. Check for clear and unobstructed airflow into and out of the chassis.
2. Ensure the SDR is programmed and correct chassis has been selected.
3. Ensure there are no fan failures.
4. Ensure the air used to cool the system is within the thermal specifications for the system (typically below 35°C).
5.3 System Air Flow Monitoring Sensor
The BMC provides an IPMI sensor to report the volumetric system airflow in CFM (cubic feet per minute). The airflow in CFM is
calculated based on the system fan PWM values. The specific Pulse Width Modulation (PWM or PWMs) used to determine the CFM
is SDR-configurable. The relationship between PWM and CFM is based on a lookup table in an OEM SDR.
The airflow data is used in the calculation for exit air temperature monitoring. It is exposed as an IPMI sensor to allow a data center
management application to access this data for use in rack-level thermal management.
This sensor is informational only and will not log events into the SEL.
58 Intel order number G90620-002 Revision 1.1
Page 69
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
11
Sensor Type
07h = Processor
12
Sensor Number
70h = Processor 1 Status
71h = Processor 2 Status
72h = Processor 3 Status
73h = Processor 4 Status
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset as described in Table 48
15
Event Data 2
Not used
16
Event Data 3
Not used
Processor Subsystem
6. Processor Subsystem
Intel® servers report multiple processor-centric sensors in the SEL.
6.1 Processor Status Sensor
The BMC provides an IPMI sensor of type processor for monitoring status information for each processor slot. If an event state
(sensor offset) has been asserted, it remains asserted until one of the following happens:
A rearm Sensor Events command is executed for the processor status sensor.
AC or DC power cycle, system reset, or system boot occurs.
CPU Presence status is not saved across A/C power cycles and therefore will not generate a deassertion after cycling AC power.
Table 47: Process Status Sensors Typical Characteristics
Revision 1.1 Intel order number G90620-002 59
Page 70
Processor Subsystem
Event Trigger
Offset
Processor Status
Next Steps
0h
Internal error (IERR)
1. Cross test the processors.
2. Replace the processors depending on the results of the test.
1h
Thermal trip
This event normally only happens due to failures of the thermal solution:
1. Verify heatsink is properly attached and has thermal grease.
2. If the system has a heatsink fan, ensure the fan is spinning.
3. Check all system fans are operating properly.
4. Check that the air used to cool the system is within limits (typically
35°C).
2h
FRB1/BIST failure
1. Cross test the processors.
2. Replace the processors depending on the results of the test.
3h
FRB2/Hang in POST failure
4h
FRB3/Processor startup/initialization failure (CPU fails to
start)
5h
Configuration error (for DMI)
6h
SM BIOS uncorrectable CPU-complex error
7h
Processor presence detected
Informational Event
8h
Processor disabled
1. Cross test the processors.
2. Replace the processors depending on the results of the test.
9h
Terminator presence detected
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Table 48: Processor Status Sensors – Next Steps
60 Intel order number G90620-002 Revision 1.1
Page 71
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
11
Sensor Type
07h = Processor
12
Sensor Number
80h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 03h (Digital Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 1h (State Asserted)
15
Event Data 2
Event Data 2 values as described in Table 50.
16
Event Data 3
Bitmap of the CPU that causes the system CATERR.
[0]: CPU1
[1]: CPU2
[2]: CPU3
[3]: CPU4
Note: If more than one bit is set, the BMC cannot
determine the source of the CATERR.
ED2
Description
Next Steps
00h
Unknown
1. Cross test the processors.
2. Replace the processors depending on the results of the test.
Processor Subsystem
6.2 Catastrophic Error Sensor
When the Catastrophic Error signal (CATERR#) stays asserted, it is a sign that something serious has gone wrong in the hardware.
The BMC monitors this signal and reports when it stays asserted.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
6.3 CPU Missing Sensor
The CPU Missing sensor is a discrete sensor reporting the processor is not installed. The most common instance of this event is due
to a processor populated in the incorrect socket.
Table 51: CPU Missing Sensor Typical Characteristics
62 Intel order number G90620-002 Revision 1.1
Page 73
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
8 9 Generator ID
0001h = BIOS POST
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
09h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 77h (OEM Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset
Processor Subsystem
6.3.1 CPU Missing Sensor – Next Steps
Verify the processor is installed in the correct slot.
6.4 Quick Path Interconnect Sensors
The Intel® Quick Path Interconnect (QPI) bus on Intel® EPSD Boards Based on Intel® Xeon® Processor E5‑
4600/2600/2400/1600/1400 Product Families is the interconnect between processors.
The QPI Link Width Reduced sensor is used by the BIOS POST to report when the link width has been reduced. Therefore the
Generator ID will be 01h.
The QPI Error sensors are reported by the BIOS SMI Handler to the BMC so the Generator ID will be 33h.
6.4.1 QPI Link Width Reduced Sensor
BIOS POST has reduced the QPI Link Width because of an error condition seen during initialization.
Revision 1.1 Intel order number G90620-002 63
Table 52: QPI Link Width Reduced Sensor Typical Characteristics
Page 74
Processor Subsystem
Byte
Field
Description
1h = Reduced to ½ width
2h = Reduced to ¼ width
15
Event Data 2
0-3 = CPU1-4
16
Event Data 3
Not used
Byte
Field
Description
8 9 Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
06h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 72h (OEM Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
[3:0] – Event Trigger Offset = Reserved
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
6.4.1.1 QPI Link Width Reduced Sensor – Next Steps
If the error continues:
1. Check the processor is installed correctly.
2. Inspect the socket for bent pins.
3. Cross test the processor. If the issue remains with the processor socket, replace the main board, otherwise the processor.
6.4.2 QPI Correctable Error Sensor
The system detected an error and corrected it. This is an informational event.
0h = Illegal inbound request
1h = IIO Write Cache Uncorrectable Data ECC Error
2h = IIO CSR crossing 32-bit boundary Error
3h = IIO Received XPF physical/logical redirect interrupt inbound
4h = IIO Illegal SAD or Illegal or non-existent address or memory
5h = IIO Write Cache Coherency Violation
15
Event Data 2
0-3 = CPU1-4
16
Event Data 3
Not used
Processor Subsystem
6.4.3.1 QPI Fatal Error and Fatal Error #2 – Next Steps
This is an Informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:
1. Check the processor is installed correctly.
2. Inspect the socket for bent pins.
3. Cross test the processor. If the issue remains with the processor socket, replace the main board, otherwise the processor.
6.5 Processor ERR2 Timeout Sensor
The BMC supports an ERR2 Timeout Sensor (1 per CPU) that asserts if a CPU’s ERR2 signal has been asserted for longer than a
fixed time period (> 90 seconds). ERR[2] is a processor signal that indicates when the IIO (Integrated IO module in the processor)
has a fatal error which could not be communicated to the core to trigger SMI. ERR[2] events are fatal error conditions, where the
BIOS and OS will attempt to gracefully handle error, but may not always do so reliably. A continuously asserted ERR2 signal is an
indication that the BIOS cannot service the condition that caused the error. This is usually because that condition prevents the BIOS
from running.
When an ERR2 timeout occurs, the BMC asserts/deasserts the ERR2 Timeout Sensor, and logs a SEL event for that sensor. The
default behavior for BMC core firmware is to initiate a system reset upon detection of an ERR2 timeout. The BIOS setup utility
provides an option to disable or enable system reset by the BMC on detection of this condition.
1. Check the SEL for any other events around the time of the failure.
2. Take note of all IPMI activity that was occurring around the time of the failure. Capture a System BMC Debug Log as soon as you
can after experiencing this failure. This log can be captured from the Integrated BMC Web Console or by using the Intel® Syscfg
utility (syscfg /sbmcdl private filename.zip). Send the log file to your system manufacturer or Intel representative for failure
analysis.
6.6 Processor MSID Mismatch Sensor
The BMC supports a MSID Mismatch sensor for monitoring for the fault condition that will occur if there is a power rating
incompatibility between a baseboard and a processor.
The sensor is rearmed on power-on (AC or DC power on transitions).
68 Intel order number G90620-002 Revision 1.1
Page 79
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Verify the processor is supported by your baseboard. Check your boards Technical Product Specification (TPS).
Revision 1.1 Intel order number G90620-002 69
Page 80
Memory Subsystem
Byte
Field
Description
8 9 Generator ID
0001h = BIOS POST
11
Sensor Type
0ch = Memory
12
Sensor Number
02h
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
7. Memory Subsystem
Intel® servers report memory errors, status, and configuration in the SEL.
7.1 Memory RAS Configuration Status
A Memory RAS Configuration Status event is logged after an AC power-on occurs, only if any RAS Mode is currently configured, and
only if RAS Mode is successfully initiated.
This is to make sure that there is a record in the SEL telling what the RAS Mode was at the time that the system started up. This is
only logged after AC power-on, not DC power-on.
The Memory RAS Configuration Status Sensor is also used to log an event during POST whenever there is a RAS configuration
error. This is a case where a RAS Mode has been selected but when the system boots, the memory configuration cannot support the
RAS Mode. The memory configuration fails, and operates in Independent Channel Mode.
In the SEL record logged, the ED1 Offset value is “RAS Configuration Disabled”, and ED3 contains the RAS Mode that is currently
selected but could not be configured. ED2 gives the reason for the RAS configuration failure – at present, only two “RAS Configuration Error Type” values are implemented:
70 Intel order number G90620-002 Revision 1.1
0 = None – This is used for an AC power-on log record when the RAS configuration is successfully configured.
3 = Invalid DIMM Configuration for RAS Mode – The installed DIMM configuration cannot support the currently selected RAS
Mode. This may be due to DIMMs that have failed or been disabled, so when this reason has been logged, the user
should check the preceding SEL events to see whether there are DIMM error events.
Table 58: Memory RAS Configuration Status Sensor Typical Characteristics
Page 81
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 09h (digital Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset as described in Table 59
Mirrored channel mode is disabled
(either in setup or due to unavailability
of memory at post, in which case post
error 8500 is also logged).
1. If this event is accompanied by a post error 8500, there was a problem
applying the mirroring configuration to the memory. Check for other errors
related to the memory and troubleshoot accordingly.
2. If there is no post error, mirror mode was simply disabled in BIOS setup and
this should be considered informational only.
Memory Subsystem
Table 59: Memory RAS Configuration Status Sensor – Event Trigger Offset – Next Steps
Revision 1.1 Intel order number G90620-002 71
Page 82
Memory Subsystem
Byte
Field
Description
8 9 Generator ID
0001h = BIOS POST
11
Sensor Type
0ch = Memory
12
Sensor Number
12h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 09h (digital Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
7.2 Memory RAS Mode Select
Memory RAS Mode Select events are logged to record changes in RAS Mode.
When a RAS Mode selection is made that changes the RAS Mode (including selecting a RAS Mode from or to Independent Channel
Mode), that change is logged to SEL in a Memory RAS Mode Select event message, which records the previous RAS Mode (from)
and the newly selected RAS Mode (to). The event also includes an Offset value in ED1 which indicates whether the mode change
left the system with a RAS Mode active (Enabled), or not (Disabled – Independent Channel Mode selected).This sensor provides the
Spare Channel mode RAS Configuration status. Memory RAS Mode Select is an informational event.
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset
0h = Fully Redundant
2h = Redundancy Degraded
Memory Subsystem
7.3 Mirroring Redundancy State
Mirroring Mode protects memory data by full redundancy – keeping complete copies of all data on both channels of a Mirroring
Domain (channel pair). If an Uncorrectable Error, which is normally fatal, occurs on one channel of a pair, and the other channel is
still intact and operational, then the Uncorrectable Error is “demoted” to a Correctable Error, and the failed channel is disabled.
Because the Mirror Domain is no longer redundant, a Mirroring Redundancy State SEL Event is logged.
Table 61: Mirroring Redundancy State Sensor Typical Characteristics
Revision 1.1 Intel order number G90620-002 73
Page 84
Memory Subsystem
Byte
Field
Description
15
Event Data 2
Location
[7:4] = Mirroring Domain
0-1 = Channel Pair for Socket
[3:2] = Reserved
[1:0] = Rank on DIMM
0-3 = Rank Number
16
Event Data 3
Location
[7:5] = Socket ID
0-3 = CPU1-4
[4:3] = Channel
0-3 = Channel A-D for Socket
[2:0] = DIMM
0-2 = DIMM 1-3 on Channel
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
7.3.1 Mirroring Redundancy State Sensor – Next Steps
This event is accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected
DIMM).
For boards with DIMM Fault LEDs, the appropriate Fault LED is lit to indicate which DIMM was the source of the error triggering the
Mirroring Failover action, that is, the failing DIMM.
7.4 Sparing Redundancy State
Rank Sparing Mode is a Memory RAS configuration option that reserves one memory rank per channel as a “spare rank”. If any rank
on a given channel experiences enough Correctable ECC Errors to cross the Correctable Error Threshold, the data in that rank is
copied to the spare rank, and then the spare rank is mapped into the memory array to replace the failing rank.
Rank Sparing Mode protects memory data by reserving a “Spare Rank” on each channel that has memory installed on it. If a
Correctable Error Threshold event occurs, the data from the failing rank is copied to the Spare Rank on the same channel, and the
failing DIMM is disabled. Because the Sparing Domain is no longer redundant, a Sparing Redundancy State SEL Event is logged.
74 Intel order number G90620-002 Revision 1.1
Page 85
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
8 9 Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
0ch = Memory
12
Sensor Number
11h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 0Bh (Generic Discrete)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset
0h = Fully Redundant
2h = Redundancy Degraded
15
Event Data 2
Location
[7:4] = Sparing Domain
0-3 = Channel A-D for Socket
[3:2] = Reserved
[1:0] = Rank on DIMM
0-3 = Rank Number
16
Event Data 3
Location
[7:5]= Socket ID
0-3 = CPU1-4
[4:3] = Channel
0-3 = Channel A-D for Socket
[2:0] = DIMM
0-2 = DIMM 1-3 on Channel
Table 62: Sparing Redundancy State Sensor Typical Characteristics
Memory Subsystem
Revision 1.1 Intel order number G90620-002 75
Page 86
Memory Subsystem
Byte
Field
Description
8 9 Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
0ch = Memory
12
Sensor Number
02h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
7.4.1 Sparing Redundancy State Sensor – Next Steps
This event is accompanied by memory errors indicating the source of the issue. Troubleshoot accordingly (probably replace affected
DIMM).
For boards with DIMM Fault LEDs, the appropriate Fault LED is lit to indicate which DIMM was the source of the error triggering the
Mirroring Failover action, that is, the failing DIMM.
7.5 ECC and Address Parity
1. Memory data errors are logged as correctable or uncorrectable.
2. Uncorrectable errors are fatal.
3. Memory addresses are protected with parity bits and a parity error is logged. This is a fatal error.
7.5.1 Memory Correctable and Uncorrectable ECC Error
ECC errors are divided into Uncorrectable ECC Errors and Correctable ECC Errors. A “Correctable ECC Error” actually represents a
threshold overflow. More Correctable Errors are detected at the memory controller level for a given DIMM within a given timeframe.
In both cases, the error can be narrowed down to particular DIMM(s). The BIOS SMI error handler uses this information to log the
data to the BMC SEL and identify the failing DIMM module.
76 Intel order number G90620-002 Revision 1.1
Table 63: Correctable and Uncorrectable ECC Error Sensor Typical Characteristics
Page 87
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset as described in Table 64
15
Event Data 2
[7:2] – Reserved. Set to 0.
[1:0] – Rank on DIMM
0-3 = Rank number
16
Event Data 3
[7:5] – Socket ID
0-3 = CPU1-4
[4:3] –Channel
0-3 = Chan A-D for Socket
[2:0] DIMM
0-2 = DIMM 1-3 on Channel
Event Trigger Offset
Description
Next Steps
Hex
Description
01h
Uncorrectable ECC
Error
An uncorrectable (multi-bit) ECC error has occurred. This
is a fatal issue that will typically lead to an OS crash
(unless memory has been configured in a RAS mode).
The system will generate a CATERR# (catastrophic error)
and an MCE (Machine Check Exception Error).
While the error may be due to a failing DRAM chip on the
DIMM, it can also be cause by incorrect seating or
improper contact between socket and DIMM, or by bent
pins in the processor socket.
1. If needed, decode DIMM location from hex version of SEL.
2. Verify the DIMM is seated properly.
3. Examine gold fingers on edge of the DIMM to verify
contacts are clean.
4. Inspect the processor socket this DIMM is connected to for
bent pins, and if found, replace the board.
5. Consider replacing the DIMM as a preventative measure.
For multiple occurrences, replace the DIMM.
00h
Correctable ECC
Error threshold
reached
There have been too many (10 or more) correctable ECC
errors for this particular DIMM since last boot. This event
in itself does not pose any direct problems because the
ECC errors are still being corrected. Depending on the
RAS configuration of the memory, the IMC may take the
affected DIMM offline.
Even though this event doesn't immediately lead to problems, it
can indicate one of the DIMM modules is slowly failing. If this
error occurs more than once:
1. If needed, decode DIMM location from hex version of SEL.
2. Verify the DIMM is seated properly.
3. Examine gold fingers on edge of the DIMM to verify
contacts are clean.
4. Inspect the processor socket this DIMM is connected to for
bent pins, and if found, replace the board.
Table 64: Correctable and Uncorrectable ECC Error Sensor Event Trigger Offset – Next Steps
Memory Subsystem
Revision 1.1 Intel order number G90620-002 77
Page 88
Memory Subsystem
Event Trigger Offset
Description
Next Steps
Hex
Description
5. Consider replacing the DIMM as a preventative measure.
For multiple occurrences, replace the DIMM.
Byte
Field
Description
8 9 Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
0ch = Memory
12
Sensor Number
13h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 2h
15
Event Data 2
[7:5] – Reserved. Set to 0.
[4] – Channel Information Validity Check:
0b = Channel Number in Event Data 3 Bits[4:3] is not valid
1b = Channel Number in Event Data 3 Bits[4:3] is valid
[3] – DIMM Information Validity Check:
0b = DIMM Slot ID in Event Data 3 Bits[2:0] is not valid
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
7.5.2 Memory Address Parity Error
Address Parity errors are errors detected in the memory addressing hardware. Because these affect the addressing of memory
contents, they can potentially lead to the same sort of failures as ECC errors. They are logged as a distinct type of error because
they affect memory addressing rather than memory contents, but otherwise they are treated exactly the same as Uncorrectable ECC
Errors. Address Parity errors are logged to the BMC SEL, with Event Data to identify the failing address by channel and DIMM to the
extent that it is possible to do so.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
1b = DIMM Slot ID in Event Data 3 Bits[2:0] is valid
[2:0] – Error Type:
000b = Parity Error Type not known
001b = Data Parity Error (not used)
010b = Address Parity Error
All other values are reserved.
16
Event Data 3
[7:5] – Indicates the Processor Socket to which the DDR3 DIMM having the ECC error is attached:
0-3 = CPU1-4
All other values are reserved.
[4:3] – Channel Number (if valid) on which the Parity Error occurred. This value will be indeterminate and should be ignored if ED2
Bit [4] is 0b.
00b = Channel A
01b = Channel B
10b = Channel C
11b = Channel D
[2:0] – DIMM Slot ID (if valid) of the specific DIMM that was involved in the transaction that led to the parity error. This value will
be indeterminate and should be ignored if ED2 Bit [3] is 0b.
000b = DIMM Socket 1
001b = DIMM Socket 2
010b = DIMM Socket 3
All other values are reserved.
Memory Subsystem
7.5.2.1 Memory Address Parity Error Sensor – Next Steps
These are bit errors that are detected in the memory addressing hardware. An Address Parity Error implies that the memory address
transmitted to the DIMM addressing circuitry has been compromised, and data read or written is compromised in turn. An Address
Parity Error is logged as such in SEL but in all other ways is treated the same as an Uncorrectable ECC Error.
While the error may be due to a failing DRAM chip on the DIMM, it can also be cause by incorrect seating or improper contact
between the socket and DIMM, or by the bent pins in the processor socket.
1. If needed, decode DIMM location from hex version of SEL.
2. Verify the DIMM is seated properly.
3. Examine gold fingers on edge of the DIMM to verify contacts are clean.
Revision 1.1 Intel order number G90620-002 79
Page 90
Memory Subsystem
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
4. Inspect the processor socket this DIMM is connected to for bent pins, and if found, replace the board.
5. Consider replacing the DIMM as a preventative measure. For multiple occurrences, replace the DIMM.
80 Intel order number G90620-002 Revision 1.1
Page 91
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
8 9 Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
03h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset
PCI Express* and Legacy PCI Subsystem
8. PCI Express* and Legacy PCI Subsystem
The PCI Express* (PCIe) Specification defines standard error types under the Advanced Error Reporting (AER) capabilities. The
BIOS logs AER events into the SEL.
The Legacy PCI Specification error types are PERR and SERR. These errors are supported and logged into the SEL.
8.1 PCI Express* Errors
PCIe error events are either correctable (informational event) or fatal. In both cases information is logged to help identify the source
of the PCIe error and the bus, device, and function is included in the extended data fields. The PCIe devices are mapped in the
operating system by bus, device, and function. Each device is uniquely identified by the bus, device, and function. PCIe device
information can be found in the operating system.
8.1.1 Legacy PCI Errors
Legacy PCI errors include PERR and SERR; both are fatal errors.
[7:3] – PCI Device number
[2:0] – PCI Function number
Byte
Field
Description
8 9 Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
04h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 70h (OEM Specific)
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
8.1.1.1 Legacy PCI Error Sensor – Next Steps
1. Decode the bus, device, and function to identify the card.
2. If this is an add-in card:
a. Verify the card is inserted properly.
b. Install the card in another slot and check whether the error follows the card or stays with the slot.
c. Update all firmware and drivers, including non-Intel components.
3. If this is an on-board device:
a. Update all BIOS, firmware, and drivers.
b. Replace the board.
8.1.2 PCI Express* Fatal Errors and Fatal Error #2
When a PCI Express* fatal error is reported to the BIOS SMI handler, it will record the error using the following format.
[7:3] – PCI Device number
[2:0] – PCI Function number
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
8.1.2.1 PCI Express* Fatal Error and Fatal Error #2 Sensor – Next Steps
1. Decode the bus, device, and function to identify the card.
2. If this is an add-in card:
a. Verify the card is inserted properly.
b. Install the card in another slot and check whether the error follows the card or stays with the slot.
c. Update all firmware and drivers, including non-Intel components.
3. If this is an on-board device:
a. Update all BIOS, firmware, and drivers.
b. Replace the board.
8.1.3 PCI Express* Correctable Errors
When a PCI Express* correctable error is reported to the BIOS SMI handler, it will record the error using the following format.
84 Intel order number G90620-002 Revision 1.1
Page 95
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
8 9 Generator ID
0033h = BIOS SMI Handler
11
Sensor Type
13h = Critical Interrupt
12
Sensor Number
05h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 71h (OEM Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
8.1.3.1 PCI Express* Correctable Error Sensor – Next Steps
This is an informational event only. Correctable errors are acceptable and normal at a low rate of occurrence. If the error continues:
1. Decode the bus, device, and function to identify the card.
2. If this is an add-in card:
a. Verify the card is inserted properly.
b. Install the card in another slot and check whether the error follows the card or stays with the slot.
c. Update all firmware and drivers, including non-Intel components.
3. If this is an on-board device:
a. Update all BIOS, firmware, and drivers.
b. Replace the board.
86 Intel order number G90620-002 Revision 1.1
Page 97
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
System BIOS Events
9. System BIOS Events
There are a number of events that are owned by the system BIOS. These events can occur during Power On Self Test (POST) or
when coming out of a sleep state. Not all of these events signify errors. Some events are described in other chapters in this
document (for example, memory events).
9.1 System Events
These events can occur during POST or when coming out of a sleep state. These are informational events only.
1. When logging events during BIOS POST uses generator ID 0001h.
2. When logging events during BIOS SMI Handler uses generator ID 0033h.
9.1.1 System Boot
At the end of POST, just before the actual OS boot occurs, a System Boot Event is logged. This basically serves to mark the
transition of control from completed POST to OS Loader. It is an informational only event.
9.1.2 Timestamp Clock Synchronization
These events are used when the time between the BIOS and the BMC is synchronized. Two events are logged. The BIOS does the
first one to send the time synch message to the BMC for synchronization, and the timestamp that message gets is unknown, that is,
the timestamp in the log can be anything because it gets the "before" timestamp.
So the BIOS sends a second time synch message to get a "baseline" correct timestamp in the log. That is the "starting time".
For example, say that the time the BMC has is March 1, 2011 21:00. The BIOS time synch updates that to the same date, 21:20 (the
BMC was running behind). Without that second time synch message, you don't know that the log time jumped ahead, and when you
get the next log message it looks like there was a 20-min delay during the boot for some unknown reasons.
Without that second time synch message, the time span to the next logged message is indeterminate. With the second time synch as
a baseline, the following log timestamps are always determinate.
01h = System Boot
05h = Timestamp Clock Synchronization
15
Event Data 2
For Event Trigger Offset 05h only (Timestamp Clock
Synchronization)
00h = 1st in pair
80h = 2nd in pair
16
Event Data 3
Not used
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
The timestamp clock synchronization is run and the events are logged by the BIOS POST every time the system boots. In addition
during the shutdown from some Operating Systems the BIOS SMI Handler is called to run timestamp clock synchronization and log
the events.
Table 70: System Event Sensor Typical Characteristics
88 Intel order number G90620-002 Revision 1.1
Page 99
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
8 9 Generator ID
0001h = BIOS POST
11
Sensor Type
0Fh = System Firmware Progress (formerly POST Error)
12
Sensor Number
06h
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 6Fh (Sensor Specific)
14
Event Data 1
[7:6] – 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 0h
15
Event Data 2
Low Byte of POST Error Code
16
Event Data 3
High Byte of POST Error Code
System BIOS Events
9.2 System Firmware Progress (Formerly Post Error)
The BIOS logs any POST errors to the SEL. The 2-byte POST code gets logged in the ED2 and ED3 bytes in the SEL entry. This
event will be logged every time a POST error is displayed. Even though this event indicates an error, it may not be a fatal error. If this
is a serious error, there will typically also be a corresponding SEL entry logged for whatever was the cause of the error – this event
may contain more information about what happened than the POST error event.
Table 71: POST Error Sensor Typical Characteristics
9.2.1 System Firmware Progress (Formerly Post Error) – Next Steps
See the following table for POST Error Codes.
Revision 1.1 Intel order number G90620-002 89
Page 100
System BIOS Events
Error Code
Error Message
Response
0012
System RTC date/time not set
Major
0048
Password check failed
Major
0140
PCI component encountered a PERR error
Major
0141
PCI resource conflict
Major
0146
PCI out of resources error
Major
0191
Processor core/thread count mismatch detected
Fatal
0192
Processor cache size mismatch detected
Fatal
0194
Processor family mismatch detected
Fatal
0195
Processor Intel(R) QPI link frequencies unable to synchronize
Fatal
0196
Processor model mismatch detected
Fatal
0197
Processor frequencies unable to synchronize
Fatal
5220
BIOS Settings reset to default settings
Major
5221
Passwords cleared by jumper
Major
5224
Password clear jumper is Set
Major
8130
Processor 01 disabled
Major
8131
Processor 02 disabled
Major
8132
Processor 03 disabled
Major
8133
Processor 04 disabled
Major
8160
Processor 01 unable to apply microcode update
Major
8161
Processor 02 unable to apply microcode update
Major
8162
Processor 03 unable to apply microcode update
Major
8163
Processor 04 unable to apply microcode update
Major
8170
Processor 01 failed Self Test (BIST)
Major
8171
Processor 02 failed Self Test (BIST)
Major
8172
Processor 03 failed Self Test (BIST)
Major
8173
Processor 04 failed Self Test (BIST)
Major
8180
Processor 01 microcode update not found
Minor
8181
Processor 02 microcode update not found
Minor
8182
Processor 03 microcode update not found
Minor
8183
Processor 04 microcode update not found
Minor
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Table 72: POST Error Codes
90 Intel order number G90620-002 Revision 1.1
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.