System Event Log Troubleshooting
Guide for EPSD Platforms Based on
Intel® Xeon® Processor E5
4600/2600/2400/1600/1400
Product Families
Intel order number G90620-002
Revision 1.1
September 2013
Enterprise Platforms and Services Division – Marketing
Revision History System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
Date
Revision
Number
Modifications
January 2013
1.0
Initial release
September 2013
1.1
Added MIC Thermal Margin sensors C4 through C7.
Added MIC Status sensors A2, A3, A6, and A7.
Added voltage sensors EA, EB, EC, ED, and EF.
Corrected typographical errors.
Made corrections to Firmware Update Status table.
Made corrections to Catastrophic Error Sensor table.
Added support for S1400FP, S1400SP, S1600JP, and S4600LH.
4600/2600/2400/1600/1400 Product Families
Revision History
ii Intel order number G90620-002 Revision 1.1
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
4600/2600/2400/1600/1400 Product Families Disclaimers
Disclaimers
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS
GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR
SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR
INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly,
in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION
CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES,
SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH,
HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS'
FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL
INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR
NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF
THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not
rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves
these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from
future changes to them. The information here is subject to change without notice. Do not finalize a design with this
information.
The products described in this document may contain design defects or errors known as errata which may cause the
product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your
product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may
be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.
Revision 1.1 Intel order number G90620-002 iii
Table of Contents System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
5.1 Fan Sensors .......................................................................................................... 45
5.1.1 Fan Tachometer Sensors ...................................................................................... 45
5.1.2 Fan Presence and Redundancy Sensors .............................................................. 46
5.2 Temperature Sensors ............................................................................................ 49
iv Intel order number G90620-002 Revision 1.1
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
4600/2600/2400/1600/1400 Product Families Table of Contents
5.2.1 Threshold-based Temperature Sensors ................................................................ 49
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
4600/2600/2400/1600/1400 Product Families Table of Contents
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
4600/2600/2400/1600/1400 Product Families Introduction
1. Introduction
The server management hardware that is part of the Intel® Server Boards and Intel® Server
Platforms serves as a vital part of the overall server management strategy. The server
management hardware provides essential information to the system administrator and provides
the administrator the ability to remotely control the server, even when the operating system is
not running.
The Intel® Server Boards and Intel® Server Platforms offer comprehensive hardware and
software based solutions. The server management features make the servers simple to manage
and provide alerting on system events. From entry to enterprise systems, good overall server
management is essential to reduce overall total cost of ownership.
This Troubleshooting Guide is intended to help the users better understand the events that are
logged in the Baseboard Management Controllers (BMC) System Event Logs (SEL) on these
Intel® Server Boards.
There is a separate User’s Guide that covers the general server management and the server
management software offered on the Intel® Server Boards and Intel® Server Platforms.
Server boards currently supported by this document:
Intel® S1400FP Server Boards
Intel® S1400SP Server Boards
Intel® S1600JP Server Boards
Intel® S2400BB Server Boards
Intel® S2400EP Server Boards
Intel® S2400GP Server Boards
Intel® S2400LP Server Boards
Intel® S2400SC Server Boards
Intel® S2600CO Server Boards
Intel® S2600CP Server Boards
Intel® S2600GZ/S2600GL Server Boards
Intel® S2600IP Server Boards
Intel® S2600JF Server Boards
Intel® S2600WP Server Boards
Intel® S4600LH Server Boards
Intel® W2600CR Workstation Boards
1.1 Purpose
The purpose of this document is to list all possible events generated by the Intel platform. It may
be possible that other sources (not under our control) also generate events, which will not be
described in this document.
Revision 1.1 Intel order number G90620-002 1
Introduction System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
The key characteristic of the Intelligent Platform Management Interface (IPMI) is that the
inventory, monitoring, logging, and recovery control functions are available independently of the
main processors, BIOS, and operating system. Platform management functions can also be
made available when the system is in a power-down state.
IPMI works by interfacing with the BMC, which extends management capabilities in the server
system and operates independently of the main processor by monitoring the on-board
instrumentation. Through the BMC, IPMI also allows administrators to control power to the
server, and remotely access BIOS configuration and operating system console information.
IPMI defines a common platform instrumentation interface to enable interoperability between:
The baseboard management controller and chassis
The baseboard management controller and systems management software
Between servers
IPMI enables the following:
Common access to platform management information, consisting of:
- Local access from systems management software
- Remote access from LAN
- Inter-chassis access from Intelligent Chassis Management Bus
- Access from LAN, serial/modem, IPMB, PCI SMBus*, or ICMB, available even if the
processor is down
IPMI interface isolates systems management software from hardware.
Hardware advancements can be made without impacting the systems management
You can find more information on IPMI at the following URL:
http://www.intel.com/design/servers/ipmi
1.2.2 Baseboard Management Controller (BMC)
A baseboard management controller (BMC) is a specialized microcontroller embedded on most
Intel® Server Boards. The BMC is the heart of the IPMI architecture and provides the
intelligence behind intelligent platform management, that is, the autonomous monitoring and
recovery features implemented directly in platform management hardware and firmware.
Different types of sensors built into the computer system report to the BMC on parameters such
as temperature, cooling fan speeds, power mode, operating system status, and so on. The BMC
monitors the system for critical events by communicating with various sensors on the system
2 Intel order number G90620-002 Revision 1.1
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5
4600/2600/2400/1600/1400 Product Families Introduction
board; it sends alerts and logs events when certain parameters exceed their preset thresholds,
indicating a potential failure of the system. The administrator can also remotely communicate
with the BMC to take some corrective action such as resetting or power cycling the system to
get a hung OS running again. These abilities save on the total cost of ownership of a system.
For Intel® Server Boards and Intel® Server Platforms, the BMC supports the industry standard
IPMI 2.0 Specification, enabling you to configure, monitor, and recover systems remotely.
1.2.2.1 System Event Log (SEL)
The BMC provides a centralized, non-volatile repository for critical, warning, and informational
system events called the System Event Log or SEL. By having the BMC manage the SEL and
logging functions, it helps to ensure that “post-mortem” logging information is available if a
failure occurs that disables the system processor(s).
The BMC allows access to SEL from in-band and out-of-band mechanisms. There are various
tools and utilities that can be used to access the SEL. There is the Intel® SELView utility and
multiple open sourced IPMI tools.
1.2.3 Intel
®
Intelligent Power Node Manager Version 2.0
Intel® Intelligent Power Node Manager Version 2.0 (NM) is a platform-resident technology that
enforces power and thermal policies for the platform. These policies are applied by exploiting
subsystem knobs (such as processor P and T states) that can be used to control power
consumption. Intel® Intelligent Power Node Manager enables data center power and thermal
management by exposing an external interface to management software through which platform
policies can be specified. It also enables specific data center power management usage models
such as power limiting.
The configuration and control commands are used by the external management software or
BMC to configure and control the Intel® Intelligent Power Node Manager feature. Because
Platform Services firmware does not have any external interface, external commands are first
received by the BMC over LAN and then relayed to the Platform Services firmware over IPMB
channel. The BMC acts as a relay and the transport conversion device for these commands. For
simplicity, the commands from the management console might be encapsulated in a generic
CONFIG packet format (configuration data length, configuration data blob) to the BMC so that
the BMC doesn’t even have to parse the actual configuration data.
The BMC provides the access point for remote commands from external management SW and
generates alerts to them. Intel® Intelligent Power Node Manager on Intel® Manageability Engine
(Intel® ME) is an IPMI satellite controller. A mechanism exists to forward commands to Intel® ME
and then sends the response back to originator. Similarly events from Intel® ME will be sent as
alerts outside of the BMC.
Revision 1.1 Intel order number G90620-002 3
Basic Decoding of a SEL Record
Byte
Field
Description
1 2 Record ID
(RID)
ID used for SEL Record access.
3
Record Type
(RT)
[7:0] – Record Type
02h = System event record
C0h-DFh = OEM timestamped, bytes 8-16 OEM defined (See Table 3)
E0h-FFh = OEM non-timestamped, bytes 4-16 OEM defined (See Table 4)
4
5
6
7
Timestamp
(TS)
Time when event was logged. LS byte first.
Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010
23:20:09 UTC
Note: There are various websites that will convert the raw number to a date/time.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
2. Basic Decoding of a SEL Record
The System Event Log (SEL) record format is defined in the IPMI Specification. The following section provides a basic definition for
each of the fields in a SEL. For more details see the IPMI Specification.
The definitions for the standard SEL can be found in Table 1.
The definitions for the OEM defined event logs can be found in Table 3 and Table 4.
2.1 Default Values in the SEL Records
Unless otherwise noted in the event record descriptions the following are the default values in all SEL entries.
Byte [3] = Record Type (RT) = 02h = System event record
Byte [9:8] = Generator ID = 0020h = BMC Firmware
Byte [10] = Event Message Revision (ER) = 04h = IPMI 2.0
4 Intel order number G90620-002 Revision 1.1
Table 1. SEL Record Format
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
8 9 Generator ID
(GID)
RqSA and LUN if event was generated from IPMB.
Software ID if event was generated from system software.
Byte 1
[7:1] – 7-bit I2C Slave Address, or 7-bit system software ID
[0] 0b = ID is IPMB Slave Address
1b = System software ID
Software ID values:
0001h – BIOS POST for POST errors, RAS Configuration/State,
Timestamp Synch, OS Boot events
0033h – BIOS SMI Handler
0020h – BMC Firmware
002Ch – ME Firmware
0041h – Server Management Software
00C0h – HSC Firmware – HSBP A
00C2h – HSC Firmware – HSBP B
Byte 2
[7:4] – Channel number. Channel that event message was received over. 0h if the event
message was received from the system interface, primary IPMB, or internally generated
by the BMC.
[3:2] – Reserved. Write as 00b.
[1:0] – IPMB device LUN if byte 1 holds Slave Address. 00b otherwise.
Sensor Type Code for sensor that generated the event
12
Sensor #
(SN)
Number of sensor that generated the event (From SDR)
13
Event Dir |
Event Type
(EDIR)
Event Dir
[7] – 0b = Assertion event.
1b = Deassertion event.
Event Type
Type of trigger for the event, for example, critical threshold going high, state asserted,
and so on. Also indicates class of the event. For example, discrete, threshold, or OEM.
The Event Type field is encoded using the Event/Reading Type Code.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
6 Intel order number G90620-002 Revision 1.1
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Sensor
Class
Event Data
Threshold
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
01b = Trigger reading in Event Data 2
10b = OEM code in Event Data 2
11b = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
01b = Trigger threshold value in Event Data 3
10b = OEM code in Event Data 3
11b = Sensor-specific event extension code in Event Data 3
[3:0] – Offset from Event/Reading Code for threshold event.
Event Data 2 – Reading that triggered event, FFh or not present if unspecified.
Event Data 3 – Threshold value that triggered event, FFh or not present if unspecified. If present, Event Data 2 must be present.
discrete
Event Data 1
[7:6] – 00b = Unspecified Event Data 2
01b = Previous state and/or severity in Event Data 2
10b = OEM code in Event Data 2
11b = Sensor-specific event extension code in Event Data 2
[5:4] – 00b = Unspecified Event Data 3
01b = Reserved
10b = OEM code in Event Data 3
11b = Sensor-specific event extension code in Event Data 3
[3:0] – Offset from Event/Reading Code for discrete event state
Event Data 2
[7:4] – Optional offset from “Severity” Event/Reading Code (0Fh if unspecified).
[3:0] – Optional offset from Event/Reading Type Code for previous discrete event state (0Fh if unspecified).
Event Data 3 – Optional OEM code. FFh or not present if unspecified.
OEM
Event Data 1
[7:6] – 00b = Unspecified in Event Data 2
01b = Previous state and/or severity in Event Data 2
10b = OEM code in Event Data 2
Basic Decoding of a SEL Record
Table 2: Event Request Message Event Data Field Contents
Revision 1.1 Intel order number G90620-002 7
Basic Decoding of a SEL Record
Sensor
Class
Event Data
11b = Reserved
[5:4] – 00b = Unspecified Event Data 3
01b = Reserved
10b = OEM code in Event Data 3
11b = Reserved
[3:0] – Offset from Event/Reading Type Code
Event Data 2
[7:4] – Optional OEM code bits or offset from “Severity” Event/Reading Type Code (0Fh if unspecified).
[3:0] – Optional OEM code or offset from Event/Reading Type Code for previous event state (0Fh if unspecified).
Event Data 3 – Optional OEM code. FFh or not present if unspecified.
Byte
Field
Description
1 2 Record ID
(RID)
ID used for SEL Record access.
3
Record Type
(RT)
[7:0] – Record Type
C0h-DFh = OEM timestamped, bytes 8-16 OEM defined
4
5
6
7
Timestamp
(TS)
Time when event was logged. LS byte first.
Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 = Sun, 15 Aug 2010
23:20:09 UTC
Note: There are various websites that will convert the raw number to a date/time.
8
9
10
Manufacturer ID
LS Byte first. The manufacturer ID is a 20-bit value that is derived from the IANA
“Private Enterprise” ID.
Most significant four bits = Reserved (0000b).
000000h = Unspecified. 0FFFFFh = Reserved.
This value is binary encoded.
For example the ID for the IPMI forum is 7154 decimal, which is 1BF2h, which will be
stored in this record as F2h, 1Bh, 00h for bytes 8 through 10, respectively.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Table 3: OEM SEL Record (Type C0h-DFh)
8 Intel order number G90620-002 Revision 1.1
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
Byte
Field
Description
11
12
13
14
15
16
OEM Defined
OEM Defined. This is defined according to the manufacturer identified by the
Manufacturer ID field.
Byte
Field
Description
1 2 Record ID
(RID)
ID used for SEL Record access.
3
Record Type
(RT)
[7:0] – Record Type
E0h-FFh = OEM system event record
4
5
6
7
8
9
10
11
12
13
14
15
16
OEM
OEM Defined. This is defined by the system integrator.
Basic Decoding of a SEL Record
Table 4: OEM SEL Record (Type E0h-FFh)
Revision 1.1 Intel order number G90620-002 9
Basic Decoding of a SEL Record
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
2.2 Notes on SEL Logs and Collecting SEL Information
Whenever you capture the SEL log, you should always collect both the text/human readable version and the hex version. Because
some of the data is OEM-specific, some utilities cannot decode the information correctly. In addition with some OEM-specific data
there may be additional variables that are not decoded at all.
An example of not decoding all of the information is the BIOS timestamp synchronization event log. This event can be logged by the
BIOS during POST or it can be logged by the BIOS SMI Handler when a system is requested to do a shutdown or a restart from the
operating system (OS). See section 2.2.1 for examples. Most utilities report this as just a BIOS event and do not differentiate
between the two. But sometimes it is useful because you can see the sequence of events better. For example if there are multiple
sequences of the timestamp synchronization events, was the power lost after booting to the OS and then the system restarted, was it
multiple POST events, or was it a restart from the OS?
An example of not decoding all the information is with the PCI Express* errors and some of the Power Supply events. For the PCI
Express* errors the type of error and the PCI Bus, Device, and Function are all a part of Event Data 1 through Event Data 3. See
section 2.2.2. For the Power Supply events when there is a failure, predictive failure, or a configuration error, Event Data 2 and Event
Data 3 hold additional information that describes the Power Supplies PMBus* Command Registers and values for that particular
event. See section 2.2.3.
2.2.1 Examples of Decoding BIOS Timestamp Events
The following are some samples of BIOS timestamp events during POST and during an OS shutdown.
RID (Record ID) = 001Fh
RT (Record Type) = 02h = system event record
TS (Timestamp) = 4F8D70C3h
GID (Generator ID = 0033h = BIOS SMI Handler
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 83h
EDIR (Event Direction/Event Type) = 6Fh; [7] = 0 = Assertion Event
[6:0] = 6fh = Sensor specific
ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization
ED2 (Event Data 2) = 00h = First in pair
RID (Record ID) = 0020h
RT (Record Type) = 02h = system event record
TS (Timestamp) = 4F8D70C4h
GID (Generator ID = 0033h = BIOS SMI Handler
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 12h = System Event (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 83h
EDIR (Event Direction/Event Type) = 6fh; [7] = 0 = Assertion Event
[6:0] = 6fh = Sensor specific
ED1 (Event Data 1) = 05h = Timestamp Clock Synchronization
ED2 (Event Data 2) = 00h = First in pair
2.2.2 Example of Decoding a PCI Express* Correctable Error Events
The following is an example of decoding a PCI Express* correctable error event. For this particular event it recorded a receiver error
on Bus 0, Device 2, and Function 2. Note that correctable errors are acceptable and normal at a low rate of occurrence.
System Event Log Troubleshooting Guide for EPSD Platforms Based on Intel® Xeon® Processor E5 4600/2600/2400/1600/1400 Product Families
RT (Record Type) = 02h = system event record
TS (Timestamp) = 502E9B0Ah
GID (Generator ID = 0033h = BIOS SMI Handler
ER (Event Message Revision) = 04 = IPMI v2.0
ST (Sensor Type) = 13h = Critical Interrupt (From IPMI Specification Table 42-3, Sensor Type Codes)
SN (Sensor Number = 05h
EDIR (Event Direction/Event Type) = 71h; [7] = 0 = Assertion Event
ED1 (Event Data 1) = A0h; [7:6] = 10b = OEM code in Event Data 2
[5:4] – 10b = OEM code in Event Data 3
[3:0] – Event Trigger Offset = 0h = Receiver Error
ED2 (Event Data 2) = 00h; PCI Bus number = 0
ED3 (Event Data 3) = 12h; [7:3] – PCI Device number = 02h
[2:0] – PCI Function number = 2
[6:0] = 71h = OEM Specific for PCI Express* correctable errors
2.2.3 Example of Decoding a Power Supply Predictive Failure Event
The following is an example of decoding a Power Supply predictive failure event. For this example power supply 1 saw an A/C power
loss event with both the input under-voltage warning and fault events getting set. In most cases this means that the A/C power spiked
under the minimum warning and fault thresholds for over 20 milliseconds but the system remained powered on. If these events
continue to occur, it is advisable to check your power source.