System Event Log Troubleshooting
Guide for Intel® S5500/S3420 series
Server Boards
Intel order number G74211-001
Revision 1.0
August 2012
Enterprise Platforms and Services Division – Marketing
Disclaimers System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards
Disclaimers
Information in this document is provided in connection with Intel® products. No license, express or implied, by
estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel
Terms and Conditions of Sale for such products, Intel® assumes no liability whatsoever, and Intel® disclaims any
express or implied warranty, relating to sale and/or use of Intel® products including liability or warranties relating to
fitness for a particular purpose, merchantability, or infringement of any patent, copyright, or other intellectual property
right. Intel® products are not intended for use in medical, lifesaving, or life sustaining applications. Intel® may make
changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or
“undefined.” Intel
incompatibilities arising from future changes to them.
This document contains information on products in the design phase of development. Do not finalize a design with
this information. Revised information will be published when the product is available. Verify with your local sales office
that you have the latest datasheet before finalizing a design.
The product may contain design defects or errors known as errata which may cause the product to deviate from the
published specifications. Current characterized errata are available on request.
This document and the software described in it are furnished under license and may only be used or copied in
accordance with the terms of the license. The information in this manual is furnished for informational use only, is
subject to change without notice, and should not be construed as a commitment by Intel Corporation. Intel
Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this document or
any software that may be provided in association with this document.
Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means without the express written consent of Intel Corporation.
®
reserves these for future definition and shall have no responsibility whatsoever for conflicts or
®
’s
Intel, Pentium, Itanium, and Xeon are trademarks or registered trademarks of Intel Corporation.
*Other brands and names may be claimed as the property of others.
System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards Introduction
1. Introduction
The server management hardware that is part of Intel® server boards and Intel® server platforms
serves as a vital part of the overall server management strategy. The server management
hardware provides essential information to the system administrator and provides the
administrator the ability to remotely control the server, even when the operating system is not
running.
The Intel® server boards and Intel® server platforms offer comprehensive hardware and
software based solutions. The server management features make the servers simple to manage
and provide alerting on system events. From entry to enterprise systems, good overall server
management is essential to reducing overall total cost of ownership.
This Troubleshooting Guide is intended to help the users better understand the events that are
logged in the Baseboard Management Controllers (BMC) System Event Logs (SEL) on these
Intel® server boards.
There are separate User’s Guide that covers the general server management and the server
management software offered on Intel® server boards and Intel® server platforms.
Server boards currently supported by this document:
Intel® S3200/X38ML server boards
Intel® S5500/S3420 series server boards.
1.1 Purpose
The purpose of this document is to list all possible events generated by the Intel® platform. It
may be possible that other sources (not under our control) also generate events, which will not
be described in this document.
The key characteristic of the Intelligent Platform Management Interface (IPMI) is that the
inventory, monitoring, logging, and recovery control functions are available independent of the
main processors, BIOS, and operating system. Platform management functions can also be
made available when the system is in a powered down state.
IPMI works by interfacing with the BMC, which extends management capabilities in the server
system and operates independent of the main processor by monitoring the on-board
instrumentation. Through the BMC, IPMI also allows administrators to control power to the
server, and remotely access BIOS configuration and operating system console information.
IPMI defines a common platform instrumentation interface to enable interoperability between:
Revision 1.0 Intel order number G74211-001 1
Introduction System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards
The baseboard management controller and chassis
The baseboard management controller and systems management software
Between servers
IPMI enables the following:
Common access to platform management information, consisting of:
- Local access from systems management software
- Remote access from LAN
- Inter-chassis access from Intelligent Chassis Management Bus
- Access from LAN, serial/modem, IPMB, PCI SMBus*, or ICMB, available even if the
processor is down
IPMI interface isolates systems management software from hardware.
Hardware advancements can be made without impacting the systems management
You can find more information on IPMI at the following URL:
http://www.intel.com/design/servers/ipmi
1.2.2 Baseboard Management Controller (BMC)
A baseboard management controller (BMC) is a specialized microcontroller embedded on most
Intel® Server Boards. The BMC is the heart of the IPMI architecture and provides the
intelligence behind intelligent platform management, that is, the autonomous monitoring and
recovery features implemented directly in platform management hardware and firmware.
Different types of sensors built into the computer system report to the BMC on parameters such
as temperature, cooling fan speeds, power mode, operating system status, and so on. The BMC
monitors the system for critical events by communicating with various sensors on the system
board; it sends alerts and logs events when certain parameters exceed their preset thresholds,
indicating a potential failure of the system. The administrator can also remotely communicate
with the BMC to take some corrective action such as resetting or power cycling the system to
get a hung OS running again. These abilities save on the total cost of ownership of a system.
For Intel® server boards and Intel® Server platforms, the BMC supports the industry-standard
IPMI 2.0 Specification, enabling you to configure, monitor, and recover systems remotely.
1.2.2.1 System Event Log (SEL)
The BMC provides a centralized, non-volatile repository for critical, warning, and informational
system events called the System Event Log or SEL. By having the BMC manage the SEL and
logging functions, it helps to ensure that ‘post-mortem’ logging information is available should a
failure occur that disables the systems processor(s).
2 Intel order number G74211-001 Revision 1.0
System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards Introduction
The BMC allows access to SEL from in-band and out-of-band mechanisms. There are various
tools and utilities that can be used to access the SEL. There is the Intel® SELViewer and
multiple open sourced IPMI tools.
1.2.3 Intel
®
Intelligent Power Node Manager version 1.5
Intel® Intelligent Power Node Manager version 1.5 (NM) is a platform resident technology that
enforces power and thermal policies for the platform. These policies are applied by exploiting
subsystem knobs (such as processor P and T states) that can be used to control power
consumption. Intel® Intelligent Power Node Manager enables data center power and thermal
management by exposing an external interface to management software through which platform
policies can be specified. It also enables specific data center power management usage models
such as power limiting.
The configuration and control commands are used by the external management software or
BMC to configure and control the Intel® Intelligent Power Node Manager feature. Since Platform
Services firmware does not have any external interface, external commands are first received
by the BMC over LAN and then relayed to the Platform Services firmware over IPMB channel.
The BMC acts as a relay and the transport conversion device for these commands. For
simplicity, the commands from the management console might be encapsulated in a generic
CONFIG packet format (config data length, config data blob) to the BMC so that the BMC
doesn’t even have to even parse the actual configuration data.
BMC provides the access point for remote commands from external management SW and
generates alerts to them. Intel® Intelligent Power Node Manager on Intel® Manageability Engine
(Intel® ME) is an IPMI satellite controller. A mechanism needs to exist to forward commands to
Intel® ME and send response back to originator. Similarly events from Intel® ME have to be sent
as alerts outside of BMC. It is the responsibility of BMC to implement these mechanisms for
communication with Intel® Intelligent Power Node Manager.
The full specification can be downloaded from the following link:
[7:0] - Record Type
02h = system event record
C0h-DFh = OEM timestamped, bytes 8-16 OEM defined (See Table 3)
E0h-FFh = OEM non-timestamped, bytes 4-16 OEM defined (See Table 4)
4
5
6
7
Timestamp
(TS)
Time when event was logged. LS byte first.
Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 =Sun, 15 Aug 2010 23:20:09
UTC
Note: There are various websites that will convert the raw number to a date/time.
System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards
2. Basic decoding of a SEL Record
The System Event Log (SEL) record format is defined in the IPMI Specification. The
following section provides a basic definition for each of the fields in a SEL. For more details
see the IPMI Specification.
The definitions for the standard SEL can be found in Table 1.
The definitions for the OEM defined event logs can be found in Table 3 and Table 4.
2.1 Default values in the SEL records
Unless otherwise noted in the event record descriptions the following are the default values
in all SEL entries.
Byte [3] = Record Type (RT) = 02h = system event record
Byte [9:8] = Generator ID = 0020h = BMC Firmware
Byte [10] = Event Message Revision (ER) = 04h = IPMI 2.0
Table 1: SEL Record Format
4 Intel order number G74211-001 Revision 1.0
System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards
Byte
Field
Description
8 9 Generator ID
(GID)
RqSA and LUN if event was generated from IPMB.
Software ID if event was generated from system software.
Byte 1
[7:1] - 7-bit I2C. Slave Address, or 7-bit system software ID
[0] 0b = ID is IPMB Slave Address
1b = system software ID
Software ID values:
0001h – BIOS POST for POST errors, RAS Configuration/State,
Timestamp Synch, OS Boot events
0033h – BIOS SMI Handler
0020h – BMC Firmware
002Ch – ME Firmware
0041h – Server Management Software
00C0h – HSC Firmware – HSBP A
00C2h = HSC Firmware – HSBP B
Byte 2
[7:4] - Channel number. Channel that event message was received over. 0h if the event
message was received from the system interface, primary IPMB, or internally generated
by the BMC.
[3:2] - reserved. Write as 00b.
[1:0] - IPMB device LUN if byte 1 holds Slave Address. 00b otherwise.
Sensor Type Code for sensor that generated the event
12
Sensor #
(SN)
Number of sensor that generated the event (From SDR)
13
Event Dir |
Event Type
(EDIR)
Event Dir
[7] - 0b = Assertion event.
1b = Deassertion event.
Event Type
Type of trigger for the event, for example, critical threshold going high, state asserted,
and so on. Also indicates class of the event. For example, discrete, threshold, or OEM.
The Event Type field is encoded using the Event/Reading Type Code.
[6:0] - Event Type Codes
Per Table 2: Event Request Message Event Data Field Contents
15
Event Data 2
(ED2)
16
Event Data 3
(ED3)
Basic decoding of a SEL Record
Revision 1.0 Intel order number G74211-001 5
Basic decoding of a SEL Record
Sensor
Class
Event Data
Threshold
Event Data 1
[7:6] - 00b = unspecified Event Data 2
01b = trigger reading in Event Data 2
10b = OEM code in Event Data 2
11b = sensor-specific event extension code in Event Data 2
[5:4] - 00b = unspecified Event Data 3
01b = trigger threshold value in Event Data 3
10b = OEM code in Event Data 3
11b = sensor-specific event extension code in Event Data 3
[3:0] - Offset from Event/Reading Code for threshold event.
Event Data 2 – reading that triggered event, FFh or not present if unspecified.
Event Data 3 – threshold value that triggered event, FFh or not present if unspecified. If present,
Event Data 2 must be present.
discrete
Event Data 1
[7:6] - 00b = unspecified Event Data 2
01b = previous state and/or severity in Event Data 2
10b = OEM code in Event Data 2
11b = sensor-specific event extension code in Event Data 2
[5:4] - 00b = unspecified Event Data 3
01b = reserved
10b = OEM code in Event Data 3
11b = sensor-specific event extension code in Event Data 3
[3:0] - Offset from Event/Reading Code for discrete event state
Event Data 2
[7:4] - Optional offset from ‘Severity’ Event/Reading Code. (0Fh if unspecified).
[3:0] - Optional offset from Event/Reading Type Code for previous discrete event state. (0Fh if
unspecified.)
Event Data 3 – Optional OEM code. FFh or not present if unspecified.
OEM
Event Data 1
[7:6] - 00b = unspecified in Event Data 2
01b = previous state and/or severity in Event Data 2
11b = reserved
[3:0] - Offset from Event/Reading Type Code
Event Data 2
[7:4] - Optional OEM code bits or offset from ‘Severity’ Event/Reading Type Code. (0Fh if
unspecified).
[3:0] - Optional OEM code or offset from Event/Reading Type Code for previous event state. (0Fh if
unspecified).
Event Data 3 - Optional OEM code. FFh or not present or unspecified
System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards
Table 2: Event Request Message Event Data Field Contents
6 Intel order number G74211-001 Revision 1.0
System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards
Byte
Field
Description
1 2 Record ID
(RID)
ID used for SEL Record access.
3
Record Type
(RT)
[7:0] - Record Type
C0h-DFh = OEM timestamped, bytes 8-16 OEM defined
4
5
6
7
Timestamp
(TS)
Time when event was logged. LS byte first.
Example: TS:[29][76][68][4C] = 4C687629h = 1281914409 =Sun, 15 Aug 2010
23:20:09 UTC
Note: There are various websites that will convert the raw number to a date/time.
8
9
10
Manufacturer ID
LS Byte first. The manufacturer ID is a 20-bit value that is derived from the IANA
‘Private Enterprise’ ID.
Most significant four bits = reserved (0000b).
000000h = unspecified. 0FFFFFh = reserved.
This value is binary encoded.
For example the ID for the IPMI forum is 7154 decimal, which is 1BF2h, which would
be stored in this record as F2h, 1Bh, 00h for bytes 8 through 10, respectively.
11
12
13
14
15
16
OEM Defined
OEM Defined. This is defined according to the manufacturer identified by the
Manufacturer ID field.
Byte
Field
Description
1 2 Record ID
(RID)
ID used for SEL Record access.
3
Record Type
(RT)
[7:0] - Record Type
E0h-FFh = OEM system event record
4
5
6
7
8
9
10
11
12
13
14
15
16
OEM
OEM Defined. This is defined by the system integrator.
Basic decoding of a SEL Record
Table 3: OEM SEL Record (Type C0h-DFh)
Table 4: OEM SEL Record (Type E0h-FFh)
Revision 1.0 Intel order number G74211-001 7
Sensor Cross Reference List System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards
Sensor
Number
Sensor Name
Details Section
Next Steps
01h
Power Unit Status
(Pwr Unit Status)
Power Unit Status Sensor
Table 16: Power Unit Status Sensor – Sensor Specific Offsets – Next Steps
02h
Power Unit Redundancy
(Pwr Unit Redund)
Power Unit Redundancy Sensor
Table 18: Power Unit Redundancy Sensor – Event Trigger Offset – Next Steps
The following table can be used to find the details of sensors owned by BIOS SMI.
Table 7: BIOS SMI owned Sensors
Revision 1.0 Intel order number G74211-001 13
Sensor Cross Reference List System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards
Sensor
Number
Sensor Name
Details Section
Next Steps
01h
Backplane Temperature
HSC Backplane Temperature Sensor
Table 82: HSC Backplane Temperature Sensor – Event Trigger Offset – Next Steps
02h
Drive Slot 0 Status
HSC Drive Slot Status Sensor
HSC Drive Slot Status Sensor – Next Steps
03h
Drive Slot 1 Status
HSC Drive Slot Status Sensor
HSC Drive Slot Status Sensor – Next Steps
04h
Drive Slot 2 Status
HSC Drive Slot Status Sensor
HSC Drive Slot Status Sensor – Next Steps
05h
Drive Slot 3 Status
HSC Drive Slot Status Sensor
HSC Drive Slot Status Sensor – Next Steps
06h
Drive Slot 4 Status
HSC Drive Slot Status Sensor
HSC Drive Slot Status Sensor – Next Steps
07h
Drive Slot 5 Status
HSC Drive Slot Status Sensor
HSC Drive Slot Status Sensor – Next Steps
6 Slot HSBP
08h
Drive Slot 0 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
09h
Drive Slot 1 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
0Ah
Drive Slot 2 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
0Bh
Drive Slot 3 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
0Ch
Drive Slot 4 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
0Dh
Drive Slot 5 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
8 Slot HSBP
08h
Drive Slot 6 Status
HSC Drive Slot Status Sensor
HSC Drive Slot Status Sensor – Next Steps
3.4 Hot Swap Controller Firmware owned Sensors (GID = 00C0h/00C2h)
The following table can be used to find the details of sensors owned by the Hot Swap Controller (HSC) firmware. The HSC firmware resides on
a Hot Swap Back Planes (HSBP). There can be up to two HSBP in a system. Each HSBP will have its own GID.
00C0h = HSC Firmware – HSBP A
00C2h = HSC Firmware – HSBP B
Table 8: Hot Swap Controller Firmware owned Sensors
14 Intel order number G74211-001 Revision 1.0
System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards Sensor Cross Reference List
Sensor
Number
Sensor Name
Details Section
Next Steps
09h
Drive Slot 7 Status
HSC Drive Slot Status Sensor
HSC Drive Slot Status Sensor – Next Steps
0Ah
Drive Slot 0 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
0Bh
Drive Slot 1 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
0Ch
Drive Slot 2 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
0Dh
Drive Slot 3 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
0Eh
Drive Slot 4 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
0Fh
Drive Slot 5 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
10h
Drive Slot 6 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
11h
Drive Slot 7 Presence
HSC Drive Presence Sensor
HSC Drive Presence Sensor – Next Steps
Revision 1.0 Intel order number G74211-001 15
Sensor Cross Reference List System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards
The following table can be used to find the details of records that are owned by the Microsoft* Operating System (OS).
Table 10: Microsoft* OS owned Events
Revision 1.0 Intel order number G74211-001 17
Sensor Cross Reference List System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards
Sensor Name
Record
Type
Sensor Type
Details Section
Next Steps
Linux* Kernel Panic
02h
20h = OS Stop/Shutdown
Table 96: Linux* Kernel Panic Event Record Characteristics
Not applicable
F0h
Not applicable
Table 97: Linux* Kernel Panic String Extended Record Characteristics
3.7 Linux* Kernel Panic Events (GID = 0021)
The following table can be used to find the details of records that can be generated when there is a Linux* Kernel panic.
Table 11: Linux* Kernel Panic Events
18 Intel order number G74211-001 Revision 1.0
System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards Power Subsystems
Byte
Field
Description
11
Sensor Type
02h = Voltage
12
Sensor Number
See Table 14
13
Event Direction and
Event Type
[7] Event direction
0b = Assertion Event
1b = Deassertion Event
[6:0] Event Type = 01h (Threshold)
14
Event Data 1
[7:6] – 01b = Trigger reading in Event Data 2
[5:4] – 01b = Trigger threshold in Event Data 3
[3:0] – Event Triggers as described in Table 13
15
Event Data 2
Reading that triggered event
16
Event Data 3
Threshold value that triggered event
4. Power Subsystems
The BMC monitors the power subsystem including power supplies, select onboard voltages, and related sensors.
4.1 Voltage Sensors
The BMC monitors the main voltage sources in the system, including the baseboard, memory, and processors, using IPMI compliant
analog/threshold sensors.
Note: A voltage error could be caused by the device supplying the voltage or by the device using the voltage. For each sensor it will be noted
who is supplying the voltage and who is using it.
Table 12: Voltage Sensors Typical Characteristics
The following table describes the severity of each of the event triggers for both assertion and for deassertion.
Revision 1.0 Intel order number G74211-001 19
Power Subsystems System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards
Event Trigger
Assertion
Severity
Deassert
Severity
Description
Hex
Description
00h
Lower non critical
going low
Degraded
OK
The voltage has dropped below its lower non critical threshold.
02h
Lower critical
going low
non-fatal
Degraded
The voltage has dropped below its lower critical threshold.
07h
Upper non critical
going high
Degraded
OK
The voltage has gone over its upper non critical threshold.
09h
Upper critical
going high
non-fatal
Degraded
The voltage has gone over its upper critical threshold.
Sensor
Number
Sensor Name
Next Steps
10h
BB +1.1V IOH
This 1.1V line is supplied by the main board.
This 1.1V line is used by the I/O hub (IOH)
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the motherboard.
11h
BB +1.1V P1 Vccp
This 1.1V line is supplied by the main board.
This 1.1V line is used by processor 1.
1. Ensure all cables are connected correctly.
2. Cross test processor if possible. If the issue remains with the socket, replace the main board, otherwise the processor.
12h
BB +1.1V P2 Vccp
This 1.1V line is supplied by the main board.
This 1.1V line is used by processor 2.
1. Ensure all cables are connected correctly.
2. Cross test processor if possible. If the issue remains with the socket, replace the main board, otherwise the processor.
Table 13: Voltage Sensors Event Triggers – Description
Table 14: Voltage Sensors – Next Steps
20 Intel order number G74211-001 Revision 1.0
System Event Log Troubleshooting Guide for Intel® S5500/S3420 series Server Boards Power Subsystems
Sensor
Number
Sensor Name
Next Steps
13h
BB +1.5V P1 DDR3
This 1.5V line is supplied by the main board.
This 1.5V line is used by the memory on processor 1.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise replace the DIMM.
14h
BB +1.5V P2 DDR3
This 1.5V line is supplied by the main board.
This 1.5V line is used by the memory on processor 2.
1. Ensure all cables are connected correctly.
2. Check the DIMMs are seated properly.
3. Cross test DIMMs. If the issue remains with the DIMMs on this socket, replace the main board, otherwise the DIMM.
15h
BB +1.8V AUX
+1.8V is supplied by the main board.
+1.8V is used by the onboard NIC and I/O hub.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the main board.
16h
BB +3.3V
+3.3V is supplied by the power supplies
+3.3V is used by the PCIe and PCI-X slots.
1. Ensure all cables are connected correctly.
2. Reseat any PCI cards, try other slots.
3. If the issue follows the card, swap it, otherwise, replace the main board.
4. If the issue remains, replace the power supplies.
17h
BB +3.3V STBY
+3.3V Stby is supplied by the main board.
+3.3V Stby is used by the BMC, On-board NIC, IOH, and ICH.
1. Ensure all cables are connected correctly.
2. If the issue remains, replace the board.
3. If the issue remains, replace the power supplies.
18h
BB +3.3V Vbat
+3.3V Vbat is supplied by the CMOS battery when power is off and by the main board when power is on.
+3.3V Vbat is used by the CMOS and related circuits.
1. Replace the CMOS battery. Any battery of type CR2032 can be used.
2. If error remains (unlikely), replace the board.
Revision 1.0 Intel order number G74211-001 21
Loading...
+ 84 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.