Copyright 2014, 2017 Hewlett Packard Enterprise Development LP
Notices
The information contained herein is subject to change without notice. The only warranties for Hewlett
Packard Enterprise products and services are set forth in the express warranty statements accompanying
such products and services. Nothing herein should be construed as constituting an additional warranty.
Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained
herein.
Confidential computer software. Valid license from Hewlett Packard Enterprise required for possession,
use, or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer
Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government
under vendor's standard commercial license.
Links to third-party websites take you outside the Hewlett Packard Enterprise website. Hewlett Packard
Enterprise has no control over and is not responsible for information outside the Hewlett Packard
Enterprise website.
Acknowledgments
Intel®, Itanium®, Pentium®, Xeon®, Intel Inside®, and the Intel Inside logo are trademarks of Intel
Corporation in the U.S. and other countries.
Microsoft® and Windows® are either registered trademarks or trademarks of Microsoft Corporation in the
United States and/or other countries.
Adobe® and Acrobat® are trademarks of Adobe Systems Incorporated.
Java® and Oracle® are registered trademarks of Oracle and/or its affiliates.
UNIX® is a registered trademark of The Open Group.
Revision History
HPE Part
Number
794235–001FirstDecember 2014
794235–002SecondMarch 2015
EditionPublication Date Changes
Table Continued
HPE Part
Number
EditionPublication Date Changes
794235–003ThirdSeptember 2015
794235–004FourthJanuary 2016
•Added BL920s Gen9 blade support
•Added SLES 11 SP4 and SLES 12 OS
support
•Added RHEL 6.6, RHEL 6.7, and RHEL
7.1 OS support
•Added Windows 2012 R2 OS support
(Gen8)
•Added ESXi OS support (Gen8)
•Moved firmware update information from
installation chapter to dedicated chapter.
Refer to firmware matrix and release
notes for correct information.
•Removed detailed SLES boot/shutdown
information and add reference to Linux
and Windows white papers.
•Minor text changes and clarifications
throughout
794235–005FifthJuly 2016
•Added details for safely powering off an
enclosure
•Added BL920s Gen9+ blade support
•Added FlexFabric 20 Gb 2P 650FLB and
650M adapter support
•Added note about scrolling the Insight
Display
•Added instructions to save EFI variables
to disk
•Added sections on troubleshooting the
OA battery
•Updated illustrations for new HPE
standards.
•Updated Insight Display screens.
•Added troubleshooting scenario where
PXE fails to find the boot file.
•Updated references to the new XFM2
crossbar modules.
Table Continued
HPE Part
Number
EditionPublication Date Changes
794235–006SixthSeptember 2016
794235–007SeventhNovember 2016
794235–008EighthApril 2017
•Updated access to OS white papers for
firmware updates
•Updated Insight Display screenshots
•Included component ID for both XFM and
XFM2 modules
•Added notes that both XFM and XFM2
modules are referred to as XFM in this
document and not to mix module types in
the same system
•Updated OS support list
•Added links to current OS and spare
parts information
•Added vSphere 6.0U3 and RHEL 6.9 in
OSs supported
•Added XFM2 firmware version in FRU
replacement firmware update
procedures
794235-009NinthNovember 2018Updated Health LED in LEDs and
components
HPE Integrity Superdome X overview
HPE Integrity Superdome X is a blade-based, high-end server platform supporting the x86 processor
family which incorporates a modular design and uses the sx3000 crossbar fabric to interconnect
resources. The system also includes remote system management functionality through the HPE Onboard
Administrator (OA), which helps monitor and manage complex resources.
Integrity Superdome X supports the SuSE Linux Enterprise Server, Red Hat Enterprise Linux, and
Microsoft Windows OSs, as well as VMware ESXi. For the latest list of supported OSs, see the HPEIntegrity Superdome X Operating System Reference at
(Servers > Integrity Servers > Integrity Superdome X) or Firmware Matrix for HPE Integrity
Superdome X servers.
Complex components
Integrity Superdome X consists of a single compute enclosure containing one to eight BL920s Gen8 or
Gen9 blades. It also includes interconnect modules, manageability modules, fans, power supplies, and an
integrated LCD Insight Display. The Insight Display can be used for basic enclosure maintenance and
displays the overall enclosure health. The compute enclosure supports four XFMs that provide the
crossbar fabric which carries data between blades.
NOTE: HPE Integrity Superdome X systems may contain XFM or XFM2 crossbar modules. Unless
specifically stated otherwise, this document refers to all crossbar modules as XFMs, but the information
will generally apply to either XFM or XMF2 modules.
http://www.hpe.com/info/enterprise/docs
More information
Integrity Superdome X QuickSpecs
Power subsystem
The Integrity Superdome X compute enclosure supports two power input modules, using either single
phase or 3-phase power cords. Connecting two AC sources to each power input module provides 2N
redundancy for AC input and DC output of the power supplies.
There are 12 power supplies per Integrity Superdome X compute enclosure. Six power supplies are
installed in the upper section of the enclosure, and six power supplies are installed in the lower section of
the enclosure.
More information
Integrity Superdome X QuickSpecs
Powering off the compute enclosure
IMPORTANT: To power off the enclosure, disconnect the power cables from the lower power
supplies first, and then disconnect the power cables from the upper power supplies.
To service any internal compute enclosure component, complete the following steps in order:
Procedure
1. Power off the partition.
2. Power off all XFMs.
8 HPE Integrity Superdome X overview
3. Disconnect the power cables from the lower power supplies.
4. Disconnect the power cables from the upper power supplies.
Manageability subsystem
The Integrity Superdome X is managed by two OAs that monitor both individual components and complex
health. This information can be accessed in the following ways:
•A GUI using a remote terminal
•A CLI using a remote or local terminal
NOTE: Only one OA is required for operation. The second OA provides redundancy and automatic
failover capabilities.
Two GPSMs in the Integrity Superdome X enclosure manage CAMNET distribution to all server blades
and XFMs in the complex and provide the redundant global clock source for the complex. Fans and
power supplies in the upper section of the enclosure are monitored and controlled by the OA through the
GPSMs.
More information
Integrity Superdome X QuickSpecs
Server blades
Each BL920s server blade contains two x86 processors and up to 48 DIMMs.
Server blades and partitions
Integrity Superdome X supports multiple nPartitions of 2, 4, 6, 8, 12, or 16 sockets (1, 2, 3, 4, 6, or 8
blades). Each nPartition must include blades of the same type but the system can include nPartitions with
different blade types.
More information
Integrity Superdome X QuickSpecs
I/O subsystem
Integrity Superdome X provides I/O through mezzanine cards and FlexLOMs on individual server blades.
Each BL920s blade has two FLB slots and three Mezzanine slots.
Not all types of cards are supported on Gen8 and Gen9 blades. For a complete list of supported I/O cards
and firmware requirements, see the Firmware Matrix for HPE Integrity Superdome X servers at http://www.hpe.com/info/superdomeX-firmware-matrix.
Fibre channel and LAN connectivity are supported by the interconnect modules in the rear of the compute
enclosure. For more information, see
More information
•Interconnect bay numbering
•Integrity Superdome X QuickSpecs
•Firmware Matrix for HPE Integrity Superdome X servers
•Connecting a PC to the OA service port
Compute enclosure overview
Compute enclosure front components
NOTE: Images might not represent supported configurations.
10HPE Integrity Superdome X overview
ItemDescription
1Power supply bay 7
2Power supply bay 8
3Power supply bay 9
4Power supply bay 10
5Power supply bay 11
6Power supply bay 12
HPE Integrity Superdome X overview11
Table Continued
ItemDescription
7DVD module
8Air intake slot (Do not block)
9Power supply bay 6
10Power supply bay 5
11Insight Display
12Power supply bay 4
13Power supply bay 3
14Power supply bay 2
15Power supply bay 1
16Blade slots
17Air intake slot (Do not block)
12HPE Integrity Superdome X overview
Power supply bay numbering
HPE Integrity Superdome X overview13
Server blade slot numbering
14HPE Integrity Superdome X overview
Insight Display components
ItemDescriptionFunction
1Insight Display screenDisplays Main Menu error
messages and instructions
2Left arrow buttonMoves the menu or navigation
bar selection left one position
3Right arrow buttonMoves the menu or navigation
bar selection right one position
4OK buttonAccepts the highlighted selection
and navigates to the selected
menu
5Down arrow buttonMoves the menu selection down
one position
6Up arrow buttonMoves up the menu selection one
position
HPE Integrity Superdome X overview15
Compute enclosure rear components
ItemDescription
1AC power connectors (upper)
2Fan bay 1
3Fan bay 6
4Fan bay 2
5Fan bay 7
16HPE Integrity Superdome X overview
Table Continued
ItemDescription
6Fan bay 3
7Fan bay 8
8Fan bay 4
9Fan bay 9
10Fan bay 5
11Fan bay 10
12Power supply exhaust vent (Do not block)
13XFM bay 1
14XFM bay 2
15XFM bay 3
16XFM bay 4
17GPSM bay 2
18Interconnect bay 2
19Interconnect bay 4
20Interconnect bay 6
21Interconnect bay 8
22OA bay 2
23Power supply exhaust vent (Do not block)
24AC power connectors (lower)
25Fan bay 15
26Fan bay 14
27Fan bay 13
28Fan bay 12
29Fan bay 11
30OA bay 1
Table Continued
HPE Integrity Superdome X overview17
ItemDescription
31Interconnect bay 7
32Interconnect bay 5
33Interconnect bay 3
34Interconnect bay 1
35GPSM bay 1
Fan bay numbering
Interconnect bay numbering
Each Integrity Superdome X enclosure requires interconnect modules to provide network access for data
transfer. Interconnect modules reside in bays located in the rear of the enclosure. Review blade slot
numbering to determine which external network connections on the interconnect modules are active.
18HPE Integrity Superdome X overview
To support server blade LAN and Fibre Channel I/O connections, an appropriate type of interconnect
module is installed according to bay location.
Server blade portCompute enclosure
interconnect bay
FlexLOM 1 port 11
FlexLOM 1 port 22
FlexLOM 2 port 11
FlexLOM 2 port 22
Mezzanine 1 port 13
Interconnect bay label
Table Continued
HPE Integrity Superdome X overview19
Server blade portCompute enclosure
interconnect bay
Mezzanine 1 port 24
Mezzanine 1 port 33
Mezzanine 1 port 44
Mezzanine 2 port 15
Mezzanine 2 port 26
Mezzanine 2 port 37
Mezzanine 2 port 48
Mezzanine 3 port 17
Mezzanine 3 port 28
Mezzanine 3 port 35
Interconnect bay label
Mezzanine 3 port 46
NOTE: For information on the location of LEDs and ports on individual interconnect modules, see the
documentation that ships with the interconnect module.
More information
Integrity Superdome X QuickSpecs
Server blade overview
ProductProcessorsDIMM slotsSupported
BL920s Gen8
BL920s Gen9
24816 GB and 32
DIMM size
GB (Gen8)
16 GB, 32 GB,
and 64 GB
(Gen9)
PCIe I/O
Mezzanine
card capacity
32
PCI I/O
FlexLOM card
capacity
20HPE Integrity Superdome X overview
Server blade components
ItemDescription
1sx3000 crossbar fabric ASIC (referred to as XNC
by the Health Repository and in event logs)
2CPU 1
3Mezzanine bracket
4Mezzanine connector 1 Type A
5Mezzanine connector 2 Type A/B
6FlexLOM slot 2
7CPU 0
8Mezzanine connector 3 Type A/B
9FlexLOM slot 1
10DDR3 DIMM slots (48) — BL920s Gen8
DDR4 DIMM slots (48) — BL920s Gen9
LR DIMM slots (48) — BL920s Gen9
11SUV board
HPE Integrity Superdome X overview21
SUV cable and ports
The SUV port on the front of the server blade is used with an SUV cable to connect the blade to external
devices (serial terminal or monitor) or USB devices. The SUV port is located behind a door that stays
closed when an SUV cable is not installed.
CAUTION: The SUV cable is not designed to be used as a permanent connection; therefore be
careful when walking near the server blade. Hitting or bumping the cable might cause the port on
the server blade to break and damage the blade.
IMPORTANT: The SUV port does not provide console access and the serial port is unused.
ItemDescription
1Server blade connector
2Serial
3USB ports (2)
4Video
More information
Integrity Superdome X QuickSpecs
22HPE Integrity Superdome X overview
System specifications
Dimensions and weights
Component dimensions
Table 1: Component dimensions
ComponentWidthDepthHeight
Compute enclosure44.7 cm
17.6 in
Server blade5.13 cm
2.02 in
Component weights
82.8 cm
32.6 in
52.25 cm
20.60 in
79.8 cm
31.4 in
62.18 cm
24.48 in
Table 2: Compute enclosure weights
ComponentWeightMax. quantity per enclosure
Compute enclosure chassis64.9 kg
143.0 lb
I/O chassis22.1 kg
48.7 lb
Midplane Brick18.8 kg
41.5 lb
1
1
1
OA tray3.6 kg
8.0 lb
Active Cool Fan0.9 kg
2.7 lb
Power supply module2.3 kg
5.0 lb
Enclosure DVD module2.1 kg
4.7 lb
OA module0.8 kg
1.8 lb
1
15
12
1
2
Table Continued
System specifications23
ComponentWeightMax. quantity per enclosure
GPSM1.2 kg
XFM3.3 kg
I/O interconnect module1.3 kg
Server blade12-16 kg
More information
Generic Site Preparation Guide
Rack specifications
Table 3: Rack specifications
RackTotal
cabinet area
with packing
materials
2
2.6 lb
4
7.3 lb
8
2.9 lb
8
26-35 lb
U heightWidthDepthDynamic
load
(gross)
Static load
(H x D x W)
HPE 642
1075 mm
Intelligent
Rack
HPE 642
1200 mm
Shock
Intelligent
Rack
More information
Generic Site Preparation Guide
246.80 x
129.20 x
90 cm
(85.35 x
50.87 x
35.43 in)
218.00 x
147.00 x
90 cm
(85.82 x
57.87 x
35.43 in)
42U597.8 mm
(23.54 in)
42U597.8 mm
(23.54 in)
1,085.63 mm
(42.74 in)
1,300.2 mm
(51.19 in)
Internal and external site door requirements
Internal site doorways must obey the following height requirements:
1,134 kg
(2,500 lb)
1,460.11 kg
(3,219 lb)
1,360.8 kg
(3,000 lb)
1,360.78 kg
(3,000 lb)
24System specifications
•For the 642 1075 mm rack — no less than 200.19 cm (78.816 in)
•For the 642 1200 mm rack — no less than 200.66 cm (79.00 in)
To account for the lifted height of the pallet, external doorways must obey the following height
requirements:
•For the 642 1075 mm rack — no less than 216.80 cm (85.35 in)
•For the 642 1200 mm rack — no less than 215.00 cm (84.65 in)
More information
Generic Site Preparation Guide
Electrical specifications
Table 4: Enclosure power options
Source typeSource voltage
(nominal)
3–phase200 VAC to 240
VAC line-to-line
(phase-tophase), 3phase
50/60 Hz
3–phase220 VAC to 240
VAC line-toneutral 3-phase
50/60 Hz
Single-phase200 VAC to 240
VAC
50/60 Hz
Plug or
connector type
NEMA L15-30p,
3-Pole, 4-wire,
3 m (10 ft)
power cord
IEC 309, 4pole, 5-wire,
Red, 3 m (10 ft)
power cord
IEC 320
C19-C20
Table 5: Single-phase power cords
Circuit typePower
receptacle
required
30 A 3-phaseL15-30R. 3-
pole, 4-wire
16 AIEC 309, 4-
pole, 5-wire, red
16/20 A Singlephase
IEC 320
C19
Number of
power cords
required (per
enclosure)
4
4
12
Part numberDescriptionWhere used
8120-6895Stripped end, 240 VInternational - other
8120-6897Male IEC309, 240 VInternational
8121-0070Male GB-1002, 240 VChina
8120-6903Male NEMA L6-20, 240 VNorth America/Japan
System specifications25
Table 6: Enclosure single-phase HPE 2400 W power supply specifications
SpecificationValue
Power cordIEC-320 C19-C20
Output2450 W per power supply
Input requirements
Rated input voltage200–240 VAC
Rated input frequency50-60 Hz
Rated input current per power supply (maximum)13.8 A at 200 VAC
13.3 A at 208 VAC
12.6 A at 220 VAC
Maximum inrush current100 A for 10 ms
Ground leakage current3.5 mA
Power factor correction0.98
Table 7: Enclosure 3-phase 2400 W power supply specifications (North America/
Japan)
Number of power
cords required
(per enclosure
leaving the rack)
2
2
Table Continued
System specifications27
Source/Circuit
type
Source voltage
(nominal)
Plug or connector
type
Power receptacle
required
Number of power
cords required
(per enclosure
leaving the rack)
Single-phase 63 A200–240 VAC
50/60 Hz
IEC 309 63 A
Single Phase Blue,
3.6 m (11.8 ft)
power cord
Single-phase 30 A200–240 VAC
50/60 Hz
NEMA L6-30P
Single Phase,
3.6 m (11.8 ft)
power cord
More information
Generic Site Preparation Guide
Environmental specifications
Temperature and humidity specifications
The following table contains the allowed and recommended temperature and humidity limits for both
operating and nonoperating Integrity Superdome X systems.
SpecificationValue
Temperature range
IEC 309 63 A
Single Phase, Blue
NEMA L6-30R
Single Phase
4
6
Allowable Operating Range
Recommended Operating Range
2
2
+5° C to +40° C (41° F to 104° F)
+18° C to +27° C (64° F to 81° F)
Nonoperating (powered off)+5° C to +45° C (41° F to 113° F)
Nonoperating (storage)-40° C to +80° C (-40° F to 176° F)
Humidity Range (noncondensing)
Allowable Operating Range
Recommended Operating Range
2
2
-12° C DP and 8% RH to +24° C DP and 85% RH
+5.5° C DP to +15° C DP and 65% RH
Nonoperating (powered off)8% RH to 90% RH and 29° C DP
Nonoperating (storage)8% RH to 90% RH and 32° C DP
1
The Recommended Operating Range is recommended for continuous operation. Operating within the Allowable
Operating Range is supported but might result in a decrease in system performance.
More information
Generic Site Preparation Guide
28System specifications
Cooling requirements
Integrity Superdome X is a rack-mounted system that cools by drawing air in the front and exhausting it
out the rear. General ASHRAE best practices must be followed when installing the system in a data
center.
•Hot/cold aisle layout
•Appropriate blanking panels in any unused space in the rack.
•No gaps exist between adjacent racks, which ensures minimal air recirculation.
•An adequate hot-air return path to the computer room air conditioners (CRAC) or computer room air
handlers (CRAH), which minimizes the flow of hot air over any rack.
Integrity Superdome X utilizes variable speed fans to realize the most efficient use of air. The volume of
air required varies with the temperature of the air supplied to the inlet.
IMPORTANT: The optimal equipment orientation is a parallel layout to the air flow supply and
return. Supply air will flow down cold aisles which are parallel to equipment rows, and return air to
CRAC through parallel air flow. Perpendicular air flow causes too much room mixing, places higher
electrical loads on the room, and can lead to unexpected equipment problems.
More information
Generic Site Preparation Guide
Air quality specifications
Chemical contaminant levels in customer environments for Hewlett Packard Enterprise hardware
products must not exceed G1 (mild) levels of Group A chemicals at any time. These contaminant levels
are described in the current version of ISA–71.04 Environmental Conditions for Process Measurementand Control Systems: Airborne Contaminants.
More information
•Generic Site Preparation Guide
•ISA–71.04 Environmental Conditions for Process Measurement and Control Systems: Airborne
Contaminants
Acoustic noise specifications
The acoustic noise specifications are 8.6 bel (86 dB) (sound power level).
IMPORTANT: Hewlett Packard Enterprise recommends that anyone in the immediate vicinity of the
product for extended periods of time wear hearing protection or use other means to reduce noise
exposure.
This level of noise is appropriate for dedicated computer room environments, not office environments.
Understand the acoustic noise specifications relative to operator positions within the computer room when
adding Integrity Superdome X systems to computer rooms with existing noise sources.
More information
Generic Site Preparation Guide
System specifications29
Sample site inspection checklist for site preparation
See Customer and Hewlett Packard Enterprise Information and Site inspection checklist. You can
use these tables to measure your progress.
Table 11: Customer and Hewlett Packard Enterprise Information
Customer Information
Name:Phone number:
Street address:City or Town:
State or province:Country
Zip or postal code:
Primary customer contact:Phone number:
Secondary customer contact:Phone number:
Traffic coordinator:Phone number:
Hewlett Packard Enterprise information
Sales representative:Order number:
Representative making survey:Date:
Scheduled delivery date:
Table 12: Site inspection checklist
Check either Yes or No. If No, include comment or date.
Computer Room
NumberArea or conditionYesNoComment or
Date
1.Do you have a completed floor plan?
2.Is adequate space available for maintenance needs?
Front 91.4 cm (36 inches) minimum and rear 91.4 cm
(36 inches) minimum are recommended clearances.
3.Is access to the site or computer room restricted?
4.Is the computer room structurally complete? Expected
date of completion?
30System specifications
Table Continued
Check either Yes or No. If No, include comment or date.
Computer Room
NumberArea or conditionYesNoComment or
Date
5.Is a raised floor installed and in good condition?
What is the floor to ceiling height? [228 cm (7.5 ft)
minimum]
6.Is the raised floor adequate for equipment loading?
7.Are channels or cutouts available for cable routing?
8.Is a network line available?
9.Is a telephone line available?
10.Are customer-supplied peripheral cables and LAN cables
available and of the proper type?
11.Are floor tiles in good condition and properly braced?
12.Is floor tile underside shiny or painted?
If painted, judge the need for particulate test.
Power and Lighting
13.Are lighting levels adequate for maintenance?
14.Are AC outlets available for servicing needs (for example,
laptop usage)?
15.Does the input voltage correspond to equipment
specifications?
15a.Is dual source power used? If so, identify types and
evaluate grounding.
16.Does the input frequency correspond to equipment
specifications?
17.Are lightning arrestors installed inside the building?
18.Is power conditioning equipment installed?
19.Is a dedicated branch circuit available for equipment?
20.Is the dedicated branch circuit less than 22.86 m (75 ft)?
21.Are the input circuit breakers adequate for equipment
loads?
Safety
22.Is an emergency power shutoff switch available?
23.Is a telephone available for emergency purposes?
24.Does the computer room have a fire protection system?
Table Continued
System specifications31
Check either Yes or No. If No, include comment or date.
Computer Room
NumberArea or conditionYesNoComment or
Date
25.Does the computer room have anti-static flooring
installed?
26.Do any equipment servicing hazards exist (loose ground
wires, poor lighting, and so on)?
Cooling
27.Can cooling be maintained between 5° C (41° F) and
40° C (104° F) up to 1,525 m (5,000 ft)? Derate 1° C/
305 m (1.8° F/1,000 ft) above 1,525 m (5,000 ft) and up to
3,048 m (10,000 ft).
28.Can temperature changes be held to 5° C (9° F) per hour
with tape media? Can temperature changes be held to
20° C (36° F) per hour without tape media?
The following are examples of different types of
temperature changes.
•Unidirectional changes
— Storage operating temperature changes in excess of
20° C (36° F) is not within tolerance. Allow one hour
per 20° C (36° F) to acclimate.
•Multidirectional spurious changes
— Operating temperatures that increase 10° C (18° F)
and then decrease 10° C (18° F). This temperature
change Is within tolerance as a 20° C (36° F) change
per hour.
•Repetitive changes
— Every 15 minutes, there is a repetitive, consistent
5° C (9° F) up and down change. This repetitive
temperature change is a 40° C (72° F) change per hour
and not within tolerance.
Also note that rapid changes to temperature over a short
period are more damaging than gradual changes over
time.
29.Can humidity level be maintained at 40% to 55% at 35° C
(95 ° F) noncondensing?
30.Are air-conditioning filters installed and clean?
Storage
31.Are cabinets available for tape and disc media?
32.Is shelving available for documentation?
32System specifications
Table Continued
Check either Yes or No. If No, include comment or date.
Computer Room
NumberArea or conditionYesNoComment or
Date
Training
33.Are personnel enrolled in the System Administrator
Course?
34.Is on-site training required?
More information
Generic Site Preparation Guide
System specifications33
Updating firmware
Hewlett Packard Enterprise recommends that all firmware on all devices in your system be updated to the
latest version after hardware installation is complete. Hewlett Packard Enterprise also encourages you to
check back often for any updates that might have been posted.
There are two methods for updating the complex firmware; using SUM or manually.
Prerequisites
Before updating firmware, Hewlett Packard Enterprise strongly recommends implementing these security
best practices:
•Isolate the management network by keeping it separate from the production network and not putting it
on the open internet without additional access authentication.
•Patch and maintain LDAP and web servers.
•Run latest virus and malware scanners in your network environment.
Installing the latest complex firmware using SUM
The SUM utility enables you to deploy firmware components from either an easy-to-use interface or a
command line. It has an integrated hardware discovery engine that discovers the installed hardware and
the current versions of firmware in use on target servers. SUM contains logic to install updates in the
correct order and ensure that all dependencies are met before deployment of a firmware update. It also
contains logic to prevent version-based dependencies from destroying an installation and ensures that
updates are handled in a manner that reduces any downtime required for the update process. SUM does
not require an agent for remote installations.
SUM is included in the downloadable firmware bundles.
For more information about SUM, see the Smart Update Manager User Guide (http://www.hpe.com/
info/sum-docs).
NOTE: You can also update firmware manually. There are different firmware bundles for each method.
See the detailed instructions provided in the release notes for the firmware bundle for more information
about manually updating firmware. Also see Manually updating the complex firmware on page 34.
Manually updating the complex firmware
To update the complex firmware manually, you will:
Procedure
1. Download the firmware bundle.
2. Update the complex and nPartition firmware.
3. Update I/O firmware and SMH and WBEM providers.
4. Be sure to use only the recommended I/O firmware to avoid incompatibility with other system
firmware.
5. Check for driver and firmware updates for other devices.
34 Updating firmware
To use SUM to update the complex firmware, see Installing the latest complex firmware using SUM on
page 34.
Download firmware bundle
Hewlett Packard Enterprise recommends running only approved firmware versions. For the latest
approved firmware versions, see the Firmware Matrix for HPE Integrity Superdome X servers at http://www.hpe.com/info/superdomeX-firmware-matrix. Follow the instructions provided in the bundle
Release Notes.
For special OS requirements, see the Superdome X firmware bundle Release Notes and these OS white
papers:
•Running Linux on HPE Integrity Superdome X white paper at http://www.hpe.com/support/
superdomeXlinux-whitepaper
•Running Microsoft Windows Server on HPE Integrity Superdome X white paper at http://
www.hpe.com/support/superdomeXwindows-whitepaper
•Running VMware vSphere on HPE Integrity Superdome X white paper at http://www.hpe.com/
support/superdomeXvmware-whitepaper
Update the complex firmware
To manually update the complex firmware:
Procedure
1. Refer to the Firmware Matrix for HPE Integrity Superdome X servers document at http://
www.hpe.com/info/superdomeX-firmware-matrix.
2. Select the complex firmware version for your OS to download and extract the latest HPE Integrity
Superdome X firmware bundle. Follow the instructions provided in the bundle Release Notes.
3. Copy the bundle to a media accessible from the OA.
4. Connect a PC to OA over Telnet or SSH and login to the CLI. For more information, see Connecting a
PC to the OA service port.
5. At the CLI prompt, use the connect blade <blade#>command to connect to each blade, and
then use the exit command to return to the OA prompt.
For example:
OA> connect blade 1
</>hpiLO-> exit
IMPORTANT: This will ensure that there is communication between OA and all blades. The
firmware update will fail if communication from OA to any blade is not working.
6. Use the Health Repository to discover currently indicted and deconfigured components.
Launch the Health Repository viewer with the SHOW HR command on the Monarch OA. List indicted
and deconfigured components with the SHOW INDICT and SHOW DECONFIG commands.
Address all indicted and deconfigured components before proceeding. Replace a deconfigured blade
or OA before starting the firmware update.
Updating firmware35
7. To start the firmware update, use the UPDATE FIRMWARE command; for example update firmware<uri> all , where <uri> is the path to the firmware bundle. The "all" option must be used to update
complex AND partition firmware.
The Firmware update process can take up to 1 hour to complete. During this process, you might notice
no progress for long periods of time and connection to OA will be lost when OA reboots between
updates.
NOTE: For more information about using the UPDATE FIRMWARE command, see the HPE Integrity
Superdome X and Superdome 2 Onboard Administrator Command Line Interface User Guide.
8. After OA is rebooted, reconnect to OA and login to confirm successful updates. Run the UPDATE
SHOW FIRMWARE command to display the complex bundle version and the firmware versions installed.
Example:
Configured complex firmware bundle version: 7.6.0
Firmware on all devices matches the complex configured bundle version
NOTE: The bundle contains firmware for the complex and npartition. The bundle does not contain I/O
card drivers or firmware.
9. Verify that all partitions are ready for use with the parstatus -P command.
I/O firmware and drivers
It is important that you install the recommended I/O adapter firmware and drivers for the appropriate
complex firmware bundle. For information about supported firmware and drivers for supported I/O cards,
see Firmware Matrix for HPE Integrity Superdome X servers at http://www.hpe.com/info/superdomeX-firmware-matrix. Use the information provided in this document to download the correct firmware bundle
and drivers.
IMPORTANT: Installing incorrect or unsupported firmware can cause unpredictable behavior. The
latest IO device firmware versions might not be supported for your system. Be sure to use only the
firmware versions that are qualified and recommended for your system. Do not use the SPP as a
source of device firmware for Superdome X systems.
SMH and WBEM providers
Hewlett Packard Enterprise recommends that you install the latest versions of the SMH and WBEM
providers for your OS.
NOTE: You must install the SMH package before the WBEM providers or in the same session.
Use the information provided in the Firmware Matrix for HPE Integrity Superdome X servers
document to download the correct WBEM providers.
Reboot is not required for SMH and WBEM providers changes to take effect.
Drivers and firmware for other devices
Interconnect modules also contain firmware which can be updated.
Before installing any firmware or drivers, be sure to see the Firmware Matrix for HPE Integrity SuperdomeX servers at http://www.hpe.com/info/superdomeX-firmware-matrix. Use only the specified firmware
and drivers. Use the information provided in this document to download the correct versions. Also see the
Linux and Windows white papers for additional updates that might be needed.
36Updating firmware
Superdome X operating systems
This is the current OS support information for Superdome X systems.
OSs supported
Integrity Superdome X supports these operating systems:
•Microsoft Windows Server
◦2012 R2 (BL920s, all versions)
◦2016 (BL920s, all versions)
•VMware
◦vSphere 5.5 U2 (BL920s Gen8 up to 8 sockets)
◦vSphere 5.5 U3 (BL920s Gen8 and Gen9 v3 up to 8 sockets)
◦vSphere 6.0 (BL920s Gen8 up to 8 sockets)
◦vSphere 6.0 U1 (BL920s Gen8 up to 16 sockets and Gen9 v3 up to 8 sockets)
◦vSphere 6.0 U2 (BL920s Gen8 up to 16 sockets and Gen9 v3 & v4 up to 8 sockets)
◦vSphere 6.0 U3 (BL920s Gen8 up to 16 sockets and Gen9 v3 & v4 up to 8 sockets)
•Red Hat Linux
◦RHEL 6.5 (BL920s Gen8)
◦RHEL 6.6 (BL920s Gen8 and Gen9 v3)
◦RHEL 6.7 (BL920s, all versions)
◦RHEL 6.8 (BL920s, all versions)
◦RHEL 6.9 (BL920s, all versions)
◦RHEL 7.0 (BL920s Gen8)
◦RHEL 7.1 (BL920s Gen8 and Gen9 v3)
◦RHEL 7.2 (BL920s, all versions)
◦RHEL 7.3 (BL920s, all versions)
•SuSE Linux
◦SLES 11 SP3 (BL920s Gen8 and Gen9 v3)
◦SLES 11 SP3 for SAP (BL920s Gen8 and Gen9 v3)
◦SLES 11 SP4 (BL920s, all versions)
◦SLES 12 (BL920s Gen8 and Gen9 v3)
◦SLES 12 SP1 (BL920s, all versions)
◦SLES 12 SP2 (BL920s, all versions)
Superdome X operating systems37
Support for some OSs requires a minimum firmware version. For the minimum required firmware
versions, see the Firmware Matrix for HPE Integrity Superdome X servers at http://www.hpe.com/info/superdomeX-firmware-matrix.
For the latest list of supported OSs, see the HPE Integrity Superdome X Operating System Reference at
http://www.hpe.com/info/enterprise/docs (Servers > Integrity Servers > Integrity Superdome X) or
the Firmware Matrix for HPE Integrity Superdome X servers at http://www.hpe.com/info/superdomeX-
firmware-matrix.
Using Microsoft Windows Server
For detailed information about using the Windows OS on Integrity Superdome X systems, see the
Running Microsoft Windows Server on HPE Integrity Superdome X white paper at http://www.hpe.com/
support/superdomeXwindows-whitepaper.
Using VMware
For detailed information about using VMware on Integrity Superdome X systems, see the Running
VMware vSphere on HPE Integrity Superdome X white paper at
superdomeXvmware-whitepaper.
Using Red Hat Linux
For detailed information about using RHEL on Integrity Superdome X systems, see the Running Linux on
HPE Integrity Superdome X white paper at http://www.hpe.com/support/superdomeXlinux-
whitepaper.
http://www.hpe.com/support/
Using SuSE Linux
For detailed information about using SLES on Integrity Superdome X systems, see the Running Linux on
HPE Integrity Superdome X white paper at http://www.hpe.com/support/superdomeXlinux-
whitepaper.
38Superdome X operating systems
Partitioning
This chapter provides information on partition identification and operations.
Partition Identification
Every partition has two identifiers: a partition number (the primary identifier from an internal perspective)
and a partition name (a more meaningful handle for administrators).
Partition Number
•A numeric value that is well suited for programmatic use and required by the hardware for configuring
routing, firewalls, etc. related to nPartitions.
•Once a partition has been created, its partition number cannot be changed. In effect, a different
partition number implies a different partition.
•Only one instance of an nPartition with a given partition number can exist within a complex.
•The range of partition numbers for nPartitions is 1 – 255.
Partition Name
•A partition name is a string value which directly conveys meaning.
•The name of a partition can be changed; this includes after the partition has been created and even if
a partition is active (such is the nature of an alias).
•A partition name should at least have one of the following non-numeric characters:
◦a-z
◦A-Z
◦- (dash)
◦_ (underscore)
◦. (period)
Any other non-numeric character is not allowed in a partition name.
•nPartition names are unique within a complex.
Partition Power Operations
To activate an inactive nPartition, use the poweron partition command on the OA CLI.
To make an active partition inactive, use the poweroff partition command on the OA CLI.
To reboot an active nPartition, use the reboot partition command on the OA CLI.
To do a TOC on the nPartition and obtain a core dump, use the toc partition command from the OA
CLI.
To list all the nPartitions and their boot states and runstates (active or inactive states), use the
parstatus -P command on the OA CLI.
Partitioning39
For more information on the usage of these commands, see “Partition commands” in the HPE IntegritySuperdome X and Superdome 2 Onboard Administrator Command Line Interface User Guide.
PARSTATUS
The status of a partition and its assigned resources can be obtained by exercising various options
available with the OA CLI command parstatus. For more information on the parstatus command,
see “Partition commands” in the HPE Integrity Superdome X and Superdome 2 Onboard AdministratorCommand Line Interface User Guide.
UUID for nPartitions
The partition firmware subsystem will generate an unique nPar UUID when a user creates an nPartition.
The UUID will be communicated to system firmware, which places the UUID on the SMBIOS for the OS
and the management applications to pick up and use this as “Universally Unique Identifiers” of the
partition. The UUID would also be available for the manageability and the deployment tools and
applications through established SOAP interfaces that can query UUID. Customers can view the UUID of
the nPartition by issuing parstatus –p <npar_id> -V under the field “Partition UUID”.
nPartition states
The nPartition state indicates whether the nPartition has booted and represents the power state of
nPartition. The nPartitions will have one of the following states:
•Active nPartition
•Inactive nPartition
•Unknown
Active nPartition
An nPartition is active when a poweron operation is initiated on the nPartition and the firmware boot
process is started.
Inactive nPartition
An nPartition is considered inactive when it is not powered on. An nPartition is in inactive state after it has
been created or shut down.
Unknown nPartition
An nPartition might report a partition state of “Unknown” and a runstate of “DETACHED” after an OA
restart. This state is possible when the firmware is not able to identify the correct nPartition state due to
internal firmware errors at OA startup. The state is persistent and can only be cleared by force powering
off the nPartition from the OA. A partition in this state will not accept any partition operation for the
nPartition, except parstatus and force poweroff. Any active OS instances continue to run unhindered
even when the nPartition is in an Unknown state.
If any attempts are made to issue partition administration operations, the following error occurs:
Error: Partition state unavailable due to firmware errors. All OS instances
running in this partition will continue unimpacted.
40Partitioning
NOTE: To clear this partition state:
1. Shut down all OS instances in the nPartition.
2. Force power off the nPartition from the OA.
3. Power on the nPartition from the OA.
This is an example of parstatus output for a partition in the DETACHED state:
parstatus -P
[Partition]
Par State/RunState Status* # of # of ILM/(GB)** Partition Name
=== =============== ======= ==== ==== ============= ==============
1 Unknown/DETACHED OK 8 0 0.0/8192.0 nPar0001
* D-Degraded
** Actual allocated for Active and User requested for Inactive partitions
To list all the nPartitions and their boot states and runstates (active or inactive states), use the
parstatus -P command on the OA CLI.
parstatus -P
[Partition]
Par State/RunState Status* # of # of ILM/(GB)** Partition Name
=== =============== ======= ==== ==== ============= ==============
1 Inactive/DOWN OK 4 0 0.0/4096.0 nPar0001
2 Active/EFI OK 4 0 0.0/4096.0 nPar0002
* D-Degraded
** Actual allocated for Active and User requested for Inactive partitions
nPartition runstate
The partition runstates displayed by the status commands show the actual state of the partition varying
from a firmware boot state to a state where an OS has successfully booted in a partition. The following
table lists the runstates for an nPartition.
StateDescription
DOWNThe partition is inactive and powered off.
ACTIVATINGA boot operation has been initiated for this partition.
FWBOOTThe boot process is in the firmware boot phase for this partition and the
EFIThe partition is at the EFI shell.
OSBOOTThe boot process has started booting the OS in this partition.
UPThe OS in this partition is booted and running.
SHUTA shutdown/reboot/reset operation has been initiated on this partition.
partition has transitioned into the active status.
1
Table Continued
Partitioning41
StateDescription
DEACTIVATINGThe partition is being deactivated (powered down) as part of a shutdown
or reboot operation.
RESETTINGA partition reset is in progress.
MCAA machine check (MCA) has occurred in the partition and is being
processed.
DETACHEDThe status is not known. This might reflect an error condition or a
transitional state while partition states are being discovered.
1
OS WBEM drivers must be installed to see this runstate.
nPartition and resource health status
The nPartition and resource status reveals the current health of the hardware. The nPartition resources
can have one of the following usage status:
Resource UsageDescription
EmptyThe slot has no resource.
InactiveResource is inactive.
UnintegratedFirmware is in the process of discovering or integrating the resource. It cannot
be used for partition operations.
ActiveThe resource is active in the partition.
The partition resources might display one of the following health status:
Resource
health
OKOkay/healthyResource is present and usable.
DDeconfiguredResource has been deconfigured.
IIndictedResource has been indicted.
PDParent DeconfiguredA parent resource has been deconfigured. An example is the
PIParent IndictedSimilar to PD except the parent resource has been indicted.
MeaningComment
status of a memory DIMM which is healthy when the blade in
which it is located is deconfigured. The DIMM status is then PD.
I DIndicted and
PI PDParent Indicted and
42Partitioning
A resource has been indicted and deconfigured
Deconfigured
A parent resource has been indicted and deconfigured.
Parent Deconfigured
The health of an nPartition depends on the health of its own resources. If there are unhealthy resources,
the health of the partition is marked as Degraded. If all the resources in the partition are healthy, the
health of the partition is reported as OK.
Partitioning43
Troubleshooting
Symptom
The purpose of this chapter is to provide a preferred methodology (strategies and procedures) and tools
for troubleshooting complex error and fault conditions.
This section is not intended to be a comprehensive guide to all of the tools that can be used for
troubleshooting the system. See the HPE Integrity Superdome X and Superdome 2 Onboard
Administrator User Guide and the HPE Integrity Superdome X and Superdome 2 Onboard Administrator
Command Line Interface User Guide for additional information on troubleshooting using the OA.
General troubleshooting methodology
The system provides the following sources of information for troubleshooting:
•LED status information
•Insight Display
•OA CLI, Health Repository (HR) and Core Analysis Engine (CAE)
•OA GUI
NOTE:
Examples in this section might reflect other systems and not the currently supported configuration of the
Integrity Superdome X system.
LED status information
The LEDs provide initial status and health information. LED information should be verified by the other
sources of status information.
See LEDs and components on page 57 for more information.
TIP:
The OA CLI is the most efficient way to verify the information provided from LEDs.
OA access
You can access the OA by entering the 169.254.1.x address using either a Telnet session or a SSH
connection. This can be done by connecting a laptop to the service port on the OA tray using a standard
LAN cable using Telnet or by using a system which has access to the OA-management LAN (customer
LAN connected to the OA RJ45–port). See Connecting a PC to the OA service port for more
information about connecting to the OA service port.
IMPORTANT: The OA service (Link Up) port is not to be confused with the serial port. The OA serial
port is only used for initial system setup. Once the network is configured, the OA should be always
be accessed using Telnet or SSH connection to the Service port.
OA CLI
The central point of communication for gaining system status is the active OA.
44 Troubleshooting
Hewlett Packard Enterprise recommends checking the system status information using show complex
status before continuing with troubleshooting:
sd-oa1> show complex status
Status: OK
Enclosure ID: OK
Enclosure: OK
Robust Store: OK
CAMNET: OK
Product ID: OK
Xfabric: OK
Diagnostic Status:
Thermal Danger OK
Cooling OK
Device Failure OK
Device Degraded OK
Firmware Mismatch OK
If no issues are seen in the command output, then more troubleshooting information is required.
Gathering power related information
Gather the power information for all of the system components.
Compute enclosure
Use the show enclosure status and show enclosure powersupply all commands.
sd-oa1> show enclosure status
Enclosure 1:
Status: OK
Enclosure ID: OK
Unit Identification LED: Off
Diagnostic Status:
Internal Data OK
Thermal Danger OK
Cooling OK
Device Failure OK
Device Degraded OK
Redundancy OK
Indicted OK
Onboard Administrator:
Status: OK
Standby Onboard Administrator:
Status: OK
Power Subsystem:
Status: OK
Power Mode: Not Redundant
Power Capacity: 14400 Watts DC
Power Available: 2270 Watts DC
Present Power: 6024 Watts AC
Cooling Subsystem:
Status: OK
Fans Good/Wanted/Needed: 15/15/15
Fan 1: 10760 RPM (60%)
Fan 2: 10758 RPM (60%)
Fan 3: 10760 RPM (60%)
Fan 4: 10760 RPM (60%)
Troubleshooting45
Fan 5: 10759 RPM (60%)
Fan 6: 8600 RPM (48%)
Fan 7: 8600 RPM (48%)
Fan 8: 8600 RPM (48%)
Fan 9: 8599 RPM (48%)
Fan 10: 8599 RPM (48%)
Fan 11: 8602 RPM (48%)
Fan 12: 8601 RPM (48%)
Fan 13: 8600 RPM (48%)
Fan 14: 8597 RPM (48%)
Fan 15: 8600 RPM (48%)
sd-oa1> show enclosure powersupply all
Power Supply #1 Information:
Status: OK
AC Input Status: OK
Capacity: 2450 Watts
Current Power Output: 918 Watts
Serial Number: 5BGXF0AHL4B0S6
Product Name: HPE 2400W 80 PLUS PLATINUM
Part Number: 588603-B21
Spare Part Number: 588733-001
Product Ver: 07
Diagnostic Status:
Internal Data OK
Device Failure OK
Power Cord OK
Indicted OK
Similar information will be displayed for all other power supplies.
Collecting power status information for components at the compute enclosure
Use the show xfm status all, show blade status all, and show interconnect status
all commands to gather information on compute enclosure component power if in use:
NOTE: OA displays XFM2 information as SXFM.
NOTE: Similar information should be displayed for XFMs 1 through 3.
sd-oa1> show xfm status all
Bay 4 SXFM Status:
Health: OK
Power: On
Unit Identification LED: Off
Diagnostic Status:
Internal Data OK
Management Processor OK
Thermal Warning OK
Thermal Danger OK
Power OK <<<<
Firmware Mismatch OK
Indicted OK
Link 1: Dormant
Link 2: Dormant
Link 3: Dormant
Link 4: Dormant
sd-oa1> show blade status all
46Troubleshooting
Blade #1 Status:
Power: On
Current Wattage used: 1325 Watts
Health: OK
Unit Identification LED: Off
Diagnostic Status:
Internal Data OK
Management Processor OK
Thermal Warning OK
Thermal Danger OK
I/O Configuration OK
Power OK <<<
Cooling OK
Device Failure OK
Device Degraded OK
Device Info OK
Firmware Mismatch OK
PDHC OK
Indicted OK
sd-oa1> show interconnect status all
Interconnect Module #1 Status:
Status: OK
Thermal: OK
CPU Fault: OK
Health LED: OK
UID: Off
Powered: On
Diagnostic Status:
Internal Data OK
Management Processor OK
Thermal Warning OK
Thermal Danger OK
I/O Configuration OK
Power OK <<<
Device Failure OK
Device Degraded OK
Gathering cooling related information
Use the following commands to gather all complex cooling information:
•show enclosure fan all
sd-oa1> show enclosure fan all
Fan #1 Information:
Status: OK
Speed: 60 percent of Maximum speed
Maximum speed: 18000 RPM
Minimum speed: 10 RPM
Power consumed: 32 Watts
Product Name: Active Cool 200 Fan
Part Number: 412140-B21
Spare Part Number: 413996-001
Version: 2.9
Diagnostic Status:
Internal Data OK
Location OK
Device Failure OK
Device Degraded OK
Troubleshooting47
Missing Device OK
Indicted OK
•show blade status all
sd-oa1> show blade status all
Blade #1 Status:
Power: On
Current Wattage used: 1100 Watts
Health: OK
Unit Identification LED: Off
Virtual Fan: 36%
Diagnostic Status:
Internal Data OK
Management Processor OK
Thermal Warning OK
Thermal Danger OK
I/O Configuration OK
Power OK
Cooling OK
Location OK
Device Failure OK
Device Degraded OK
iLO Network OK
Device Info OK
Firmware Mismatch OK
Mezzanine Card OK
Deconfigured OK
PDHC OK
Indicted OK
•show xfm status all
sd-oa1> show xfm status all
Bay 4 SXFM Status:
Health: OK
Power: On
Unit Identification LED: Off
Diagnostic Status:
Internal Data OK
Management Processor OK
Thermal Warning OK <<<
Thermal Danger OK <<<
Power OK
Firmware Mismatch OK
Indicted OK
Link 1: Dormant
Link 2: Dormant
Link 3: Dormant
Link 4: Dormant
•show interconnect status all
Interconnect Module #1 Status:
Status: OK
Thermal: OK
CPU Fault: OK
Health LED: OK
UID: Off
Powered: On
Diagnostic Status:
Internal Data OK
Management Processor OK
Thermal Warning OK <<<<
48Troubleshooting
Thermal Danger OK <<<<
I/O Configuration OK
Power OK
Device Failure OK
Device Degraded OK
Gathering failure information
To obtain information about failures recorded by the system, use the following commands:
•Show cae –L
sd-oa1> show cae -L
Sl.No Severity EventId EventCategory PartitionId
EventTime Summary
###########################################################################
#####
71 Critical 3040 System Coo... N/A Fri May 18 06:26:34
2012 SXFM air intake
or exhaust temperature...
70 Critical 3040 System Coo... N/A Fri May 18 04:56:22
2012 SXFM air intake
or exhaust temperature...
•show CAE –E -n <SI.No>
Use show CAE –E -n <SI.No>to obtain more details about specific events.
oa1> show cae -E -n 70
Alert Number : 70
Event Identification :
Event ID : 3040
Server blade appears non-functional
Provider Name : CPTIndicationProvider
Event Time : Fri May 18 04:56:22 2014
Indication Identifier : 8304020120518045622
Managed Entity :
OA Name : sd-oa1
System Type : 59
System Serial No. : USExxxxxS
OA IP Address : aa.bb.cc.dd
Affected Domain :
Enclosure Name : lc-sd2
RackName : sd2
RackUID : 02SGHxxxxAVY
Impacted Domain : Complex
Complex Name : SD2
Partition ID : Not Applicable
Summary :
XFM air intake or exhaust temperature is too hot
Full Description :
The air temperature measured at one of the XFM air intakes or exhausts is too hot to allow
normal operation. Measures are being taken to increase the cooling ability of the box, and
to reduce heat generation. If the temperature continues to increase, however, partitions
might be shut down to prevent hardware damage.
Probable Cause 1 :
Data center air conditioning is not functioning properly
Recommended Action 1 :
Fix the air conditioning problem
Probable Cause 2 :
The system air intake is blocked
Troubleshooting49
Recommended Action 2 :
Check and unblock air intakes
Replaceable Unit(s) :
Part Manufacturer : HPE
Spare Part No. : AH341-67001
Part Serial No. : MYJaaaaaWV
Part Location : 0x0100ff02ff00ff51 enclosure1/xfm2
Additional Info : Not Applicable
Additional Data :
Severity : Critical
Alert Type : Environmental Alert
Event Category : System Cooling
Event Subcategory : Unknown
Probable Cause : Temperature Unacceptable
Event Threshold : 1
Event Time Window (in minutes): 0
Actual Event Threshold : 1
Actual Event Time Window (in minutes): 0
OEM System Model : NA
Original Product Number : AH337A
Current Product Number : AH337A
OEM Serial Number : NA
Version Info :
Complex FW Version : 7.4.2
Provider Version : 8.34
Error Log Data :
Error Log Bundle : 4000000000000e41
Recommended troubleshooting methodology
The recommended methodology for troubleshooting a complex error or fault is as follows:
Procedure
1. Consult the system console for any messages, emails, or other items pertaining to a server blade error
or fault.
2. Use the SHOW PARTITION CONSOLELOG <nPar ID>on the Monarch OA to view information about
a particular partition.
3. Check the Insight Display for any error messages.
4. View the front panel LEDs (power and health), locally or remotely by using the OA CLI SHOW STATUS
commands, such as SHOW ENCLOSURE STATUS, SHOW COMPLEX STATUS, or SHOW BLADE
STATUS.
5. Use the Core Analysis Engine and Health Repository to discover faults, indictments, and
deconfigurations.
Use the SHOW CAE -L, Show CAE -En ####, and SHOW HR (and SHOW INDICT and SHOWDECONFIG) from HR commands.
6. Perform the actions specified in the Action column.
7. If more details are required, see the Action column of the relevant table provided in this chapter. The
Action you are directed to perform might be to access and read one or more error logs (the event log
and/or the FPL).
You can follow the recommended troubleshooting methodology and use
Advanced troubleshooting, or go directly to the subsection of this chapter which corresponds with your
chosen entry point. The Troubleshooting entry points table below provides the corresponding subsection
50Troubleshooting
Basic troubleshooting and
or location title for the various entry points (for example, to start by examining the logs, go directly to
Using event logs on page 75).
Table 13: Troubleshooting entry points
Entry pointSubsection or location
Front panel LEDsSee Troubleshooting tables on page 52, Troubleshooting tools
Insight DisplaySee Insight Display on page 114.
Log viewersSee Using event logs on page 75.
Offline and Online DiagnosticsSee Troubleshooting tools on page 57.
Analyze eventsFor information about using HPE Insight Remote Support to analyze
Developer log collection
The OA will automatically save a set of debug logs when it notices daemon failures on the PDHC or OA.
on page 57, and LEDs and components.
system events, see http://www.hpe.com/info/
insightremotesupport.
Retrieving existing developer logs
Existing developer logs can be copied to a USB thumb drive or FTP site.
Procedure
1. Set up an FTP server or insert a USB thumb drive into the enclosure DVD module USB port.
2. SHOW USBKEY
3. SHOW ARCHIVE
NOTE: Archives beginning with CH- are the automatically collected logs.
•For USB — enter COPY archive://CH-<archive name> USB <USB path>
•For FTP — enter COPY archive://CH-<archive name> FTP://<ftp path>
NOTE: The COPY command also supports additional protocols: TFTP, HTTP, HTTPS, SCP, and SFTP.
For more information about the COPY command, see the HPE Integrity Superdome X and Superdome2 Onboard Administrator Command Line Interface User Guide.
4. CLEAR ARCHIVE
USB example:
zany-oa> SHOW ARCHIVE
Debug Logs Time
_______________________________________________ ____________________
archive://CH-zany-oa-20140529_1555–logs.tar.gz May 29, 2014 15:55
USB/dec/CH-zany-oa-20140529_1555–logs.tar.gz
The file archive://CH-zany-oa-20140529_1555–logs.tar.gz was successfully copied
to usb://d2/dec/CH-zany-oa-20140529_1555–logs.tar.gz.
Generating a debug archive
Use this procedure to generate a new debug archive, and then copy to a USB thumb drive or FTP site.
1. UPLOAD DEBUG ARCHIVE <customer name>
2. Set up an FTP server or insert a USB thumb drive into the enclosure DVD module USB port.
3. SHOW USBKEY
4. SHOW ARCHIVE
•for USB — enter COPY archive://<archive name> USB <USB path>
•for FTP — enter COPY archive://<archive name> FTP://<ftp path>
5. CLEAR ARCHIVE
FTP example:
zomok-oa? UPLOAD DEBUG ARCHIVE dec
zomok-oa> SHOW ARCHIVE
Debug Logs Time
________________________________________________ ____________________
archive://dec/zomok-oa-20140529_1513–logs.tar.gz May 29, 2014 15:13
archive://CH-zomok-oa-20140527_1605–logs.tar.gz May 27, 2014 16:05
archive://CH-zomok-oa-20140525_0534–logs.tar.gz May 25, 2014 05:34
zomok-oa> COPY archive://dec/zomok-oa-20140529_1513–logs.tar.gz
ftp://user:pass@16.114.160.113/zomok-oa-20140529_1513–logs.tar.gz
The file archive://dec/zomok-oa-20140529_1513–logs.tar.gz was successfully copied to
ftp://16.114.160.113/zomok-oa-20140529_1513–logs.tar.gz.
Troubleshooting tables
Cause
Use these troubleshooting tables to determine the symptoms or condition of a suspect server blade. Be
aware that the state of the front panel LEDs can be viewed locally or remotely using the SHOW BLADESTATUS command from the OA CLI.
52Troubleshooting
Table 14: Basic troubleshooting
StepConditionAction
1Server blade appears non-
functional – no front panel
LEDs are on and no fans are
running. OA CLI is running.
2aOA is not running; Health LED
is OFF and power icon is ON
or flashing (Only one OA is
installed).
NOTE: A single OA is not a
supported configuration.
Nothing is logged for this condition.
1. For new blade installations, review the installation
procedures.
2. Check the CAE to see if any issues have been reported.
3. Re-seat the server blade. It may take more than a minute
for the blade to fully power on.
4. As the last option, replace the server blade. The issue is
fixed when the front panel power icon is in one of the
following states:
•Flashing amber = Powered on, not active
•Green = Powered on and active
and the front panel Health icon LED is in one of the
following states:
•Off = Server blade not active; health is good.
•Green = Server blade active; health is good.
NOTE: You cannot access the OA at this time.
1. Verify that at least one upper and one lower power supply
has the following normal LED status:
•The power supply power LED is on.
•The power supply fault LED is off.
2. If the OA tray has a single OA installed, reseat the OA and
the OA tray.
3. If two OAs are installed, locate the OA with the Active LED
illuminated and either reset the active (not responding)
OA, or login to the standby OA CLI issued the FORCETAKEOVER command.
4. If the second (non-suspect) OA operates properly, then
replace the suspect OA.
The issue is fixed when OA CLI logs can be read and the
front panel OA Health LED is green.
Table Continued
Troubleshooting53
StepConditionAction
2bBlade Health LED is flashing
amber and OA CLI is running.
3aCannot see UEFI prompt on
system console. UEFI is
running.
A warning or critical failure has been detected and logged
while booting or running system firmware. Examine the OA
CLI logs for events and perform corrective actions indicated.
The issue is fixed when the front panel Health icon LED is in
one of the following states:
•Off = Server blade not active; health is good.
•Green = Server blade active; health is good.
Nothing can be logged for this condition.
1. If the blade was able to join the partition but didn't reach
the UEFI prompt, then the issue might be I/O related.
Check the CAE for any issues with PCIe card drivers.
2. If the blade was not able to join the partition, then open the
Health Repository from the OA CLI using show hr
followed by the show indict and show deconf
commands to check for entries related to processors,
processor power modules, shared memory, and core I/O
devices.
3. If this is a console issue and no other hardware problems
are indicated, replace the Monarch blade.
3bCannot find a boot disk. UEFI
is running.
The issue is fixed when the UEFI menu appears on the
system console.
Nothing might be logged for this condition.
1. Search for the boot disk path using the UEFI shell
(reconnect –r and map -r) command.
2. Check the I/O card driver settings in the UEFI Device
Manager Menu.
3. Examine the OA CLI logs for entries related to processors,
processor power modules, shared memory, and core I/O
devices. See Using event logs on page 75.
4. Review the OA SHOW ALL section for the SHOW SERVER
PORT MAP{bay} to verify that the SAN port is connected.
Then check the SAN switch for failures and verify the
correct configuration.
Table Continued
54Troubleshooting
StepConditionAction
3cPXE fails to find the boot file
on the network. UEFI is
running.
4Cannot see OS prompt on
system console. OA CLI is
running.
Nothing can be logged for this condition.
1. Verify that the network interface is connected (ifconfig
—l). Verify that the Media State: is Media present.
2. If the network interface is connected, configure an IP
address using DHCP (ifconfig —s eth0 dhcp), check
the network interface again (ifconfig —l), and ping the
PXE server (ping <PXE IP>).
If you are able to ping the PXE server, then the PXE boot
failure is probably a software issue and not related to the
system hardware.
Nothing can be logged for this condition.
Examine the OA CLI logs for entries related to OA modules,
processors, processor power modules, shared memory, and
core I/O devices. See Using event logs on page 75.
IRC or KVM can also be used.
The issue is fixed when the OS prompt appears on the
system console.
Troubleshooting55
Table 15: Advanced troubleshooting
StepSymptom/conditionAction
5Cannot read SEL.SEL logging has stopped (health is steady green and power
is steady green).
1. Examine console messages for any UEFI errors or
warnings about operation or communications.
2. Ensure that the Robust Store is functioning properly. Try to
read the FPL. If all Fans are green and reported as OK in
response to an OA CLI SHOW ENCLOSURE FAN ALL
command, then as a test, re-seat a single fan and verify
that this has generated a FPL and SEL entry.
The issue is fixed when the SEL resumes logging.
6OS is nonresponsive after
boot
7aMCA occurs during partition
operation; the server blade
reboots the OS.
NOTE: Partition reboots OS
if enabled.
Front panel LEDs indicate that the server blade’s power is
turned on, and it is either booting or running the OS (for
example, health is steady green and power is steady green).
Nothing can be logged for this condition.
1. Examine the OA CLI logs for entries related to processors,
processor power modules, shared memory, and core I/O
devices. Make sure there are no indictments or any
hardware issue or known firmware issue. See Using
event logs on page 75.
2. Use the OA CLI TC command to initiate a TOC to reset the
partition.
3. Reboot the OS and escalate.
4. Obtain the system software status dump for root cause
analysis.
The issue is fixed when the OS becomes responsive and the
root cause is determined and corrected.
Front panel LEDs indicate that the server blade detected a
fatal error that it cannot recover from through OS recovery
routines (for example, health is flashing red and power is
steady green).
1. Capture the MCA dump with the OA CLI command, show
errdump all or show errdump dir mca, and then
show errdump bundle_ID <id>for the bundle of
interest.
56Troubleshooting
2. Examine the OA CLI logs for entries related to processors,
processor power modules, shared memory, and core I/O
devices (See
details).
The issue is fixed when the root cause is determined and
corrected.
Using event logs on page 75 for more
Table Continued
StepSymptom/conditionAction
7bMCA occurs during partition
operation; server blade
reboot of OS is prevented.
NOTE: The troubleshooting
actions for this step are
identical to those in Step 7a,
except that the server blade
in this step must be powered
off, reseated and/or powered
back on, then rebooted.
(Server blade reboots OS
automatically if enabled.)
8The OA CLI and GUI display
this message:
Data stored in the OA
and DVD module do not
match that in the
enclosure.
Front panel LEDs indicate that the server blade detected a
Critical (catastrophic or viral) bus error.
System firmware is running to gather and log all error data for
this MCA event.
1. Capture the MCA dump with the OA CLI command, show
errdump all or show errdump dir mca, and then
show errdump bundle_ID <id>for the bundle of
interest.
2. Examine the OA CLI logs for entries related to processors,
processor power modules, shared memory, and core I/O
devices. See
details.
The issue is fixed when the root cause is determined and
corrected.
Consult the Hewlett Packard Enterprise Support Center to
troubleshoot and fix this Rstore failure.
Using event logs on page 75 for more
The complex is
unusable. To recover,
fix this problem and
reboot the OA.
Troubleshooting tools
Cause
Server blades use LEDs and other tools to help troubleshoot issues that occur in the server blade.
LEDs and components
Server blade front panel components
Front panel icons are not visible unless the blade is powered on and the LEDs are lit.
In the following table, the Power and Health icons refer to an Active state. A blade is considered Active
when the partition containing this blade is booting or booted.
Troubleshooting57
ItemNameDescription
1Power iconIndicates if the server blade is powered on and active.
Green = Powered on; active
Flashing amber = Powered on; not active
Off = No power supplied to the server blade
2UID iconBlue = UID on
3NIC icon 1Indicates the status of the NIC.
Solid green = Network linked; no activity
Flashing green = Network linked, activity
4NIC icon 2Indicates the status of the NIC.
Solid green = Network linked; no activity
Flashing green = Network linked; activity
5NIC icon 3Indicates the status of the NIC.
Solid green = Network linked; no activity
Flashing green = Network linked; activity
6NIC icon 4Indicates the status of the NIC.
7Health iconOff = Server blade not active; health good
58Troubleshooting
Solid green = Network linked; no activity
Flashing green = Network linked; activity
Green = Server blade active; health good
Flashing amber = Degraded
Flashing red = Critical error
Power supply LEDs
NOTE: The power supplies at the top of the enclosure are upside down.
Power LED 1 (green)Fault LED 2 (amber)Condition
OffOffNo AC power to the power supply
OnOffNormal
OffOnPower supply failure
Fan LED
Troubleshooting59
LED colorFan status
Solid greenThe fan is working.
Solid amberThe fan has failed.
Flashing amberSee the Insight Display screen.
XFM LEDs and components
ItemNameDescription
1UID LEDBlue = UID on
2Power LEDIndicates if the module is
powered on.
Green = On
3XFM crossbar fabric port 1
4Link Cable Status LED 1N/A for Integrity Superdome X
5XFM crossbar fabric port 2
6Link Cable Status LED 2N/A for Integrity Superdome X
7XFM crossbar fabric port 3
8Link Cable Status LED 3N/A for Integrity Superdome X
9XFM crossbar fabric port 4
10Link Cable Status LED 4N/A for Integrity Superdome X
11XFM crossbar fabric port 5
12Link Cable Status LED 5N/A for Integrity Superdome X
13XFM crossbar fabric port 6
14Link Cable Status LED 6N/A for Integrity Superdome X
60Troubleshooting
Table Continued
ItemNameDescription
15XFM crossbar fabric port 7
16Link Cable Status LED 7N/A for Integrity Superdome X
17XFM crossbar fabric port 8
18Link Cable Status LED 8N/A for Integrity Superdome X
19Health LEDFlashing yellow = Degraded;
indicted
Off = The power is not turned on
Green = OK
Flashing red = Deconfigured
XFM2 LEDs and components
ItemNameDescription
1UID LEDBlue = UID on
2Power LEDIndicates if the module is
powered on.
Green = On
3XFM crossbar fabric port 1
4Link Cable Status LED 1N/A for Integrity Superdome X
5XFM crossbar fabric port 2
6Link Cable Status LED 2N/A for Integrity Superdome X
7XFM crossbar fabric port 3
8Link Cable Status LED 3N/A for Integrity Superdome X
9XFM crossbar fabric port 4
Table Continued
Troubleshooting61
ItemNameDescription
10Link Cable Status LED 4N/A for Integrity Superdome X
11Health LEDFlashing yellow = Degraded;
indicted
Off = The power is not turned on
Green = OK
Flashing red = Deconfigured
GPSM LEDs and components
Item NameDescription
1Door display power
connector
2UID LEDBlue = UID on
3Health LEDFlashing yellow = Degraded; indicted
4CAMNet port 1N/A for Integrity Superdome X
5CAMNet port 2N/A for Integrity Superdome X
6CAMNet port 3N/A for Integrity Superdome X
7CAMNet port 4N/A for Integrity Superdome X
8CAMNet port 5N/A for Integrity Superdome X
9CAMNet port 6N/A for Integrity Superdome X
Unused for Integrity Superdome X systems
Off = The power is not turned on
Green = OK
Flashing red = Deconfigured
10CAMNet port 7N/A for Integrity Superdome X
11CAMNet port 8N/A for Integrity Superdome X
62Troubleshooting
Table Continued
Item NameDescription
12Local Clock Distribution
LED
13External Clock Input LEDIndicates the status of the global clock signal distributed to connected
14Global clock connector 3
15Global clock connector 2
16Global clock connector 1
17Enclosure DVD module
USB port
OA module LEDs and components
Indicates the status of the global clock signal distributed to blades in
the compute enclosure.
Green = OK
Flashing yellow = Critical error
enclosures.
Flashing green = No clock signal expected
Unused for this release of the system.
NOTE: To ensure proper system functionality, you must connect the
USB cable between the OA module and the GPSM.
Troubleshooting63
Item NameDescription
1Reset buttonFor the different uses of this button, see the HPE Integrity
Superdome X and Superdome 2 Onboard Administrator User
Guide.
2OA management LAN portStandard CAT5e (RJ-45) Ethernet port (100/1000Mb) which
provides access to the management subsystem. Access to the
OA's CLI and GUI interfaces, interconnect modules, and iLO
features, such as Virtual Media, requires connection to this port.
3UID LEDBlue = UID on
4Active OA LEDIndicates which OA is active
5Health LEDGreen = OK
Red = Critical error
6USBUSB 2.0 Type A connector used for connecting the enclosure DVD
module. Connects to the USB mini-A port on the GPSM.
NOTE: You must connect the USB cable between the OA module
and the GPSM to ensure proper system functionality.
7Serial debug portSerial RS232 DB-9 connector with PC standard pinout.
IMPORTANT: This port is for OA debug use only, and
should not be connected during normal system operation.
8VGAVGA DB-15 connector with PC standard pinout. To access the
KVM menu or OA CLI, connect a VGA monitor or rack KVM
monitor for enclosure KVM.
DVD module LEDs and components
ItemNameDescription
1USB connector
2DVD tray
64Troubleshooting
Table Continued
ItemNameDescription
3DVD activity LED
4Tray open/close button
5Manual tray release
6Health LEDGreen = OK
7UID LEDBlue = UID on
OA GUI
The OA GUI provides partition status and FRU information. For more information on using the OA GUI,
see the HPE Integrity Superdome X and Superdome 2 Onboard Administrator User Guide.
NOTE:
CAE events and errdump information is not available using the GUI. You must use the command line for
this information.
Flashing yellow = Critical error
Health Repository viewer
The Health Repository User Interface displays the information from the HR database. The HR database
contains current state and history covering both service events and the results of error events analysis.
The following information is available in the HR display:
•Description of each failure event on the system that results in a service request, even after a
component is removed or replaced.
•History of component identities.
Information in the HR database is stored as installation and action records. These records are organized
with component physical location as the key.
Indictment Records
Indictment refers to a record specifying that a component requires service. The component or a
subcomponent might or might not be deconfigured as a result. Each indictment record contains the
following information:
•The time of the error.
•The cause of the error.
•The subcomponent location of the error (when analysis allows).
In cases when the failing component cannot be identified with certainty, analysis indicts the most probable
component that will need to be replaced to solve the problem. Other components that might have been
responsible can be identified as suspects by writing a suspicion record. A suspicion record contains the
same fields as an indictment record.
Troubleshooting65
Deconfiguration is the act of disabling a component in the system. This happens when analysis finds that
a component has a serious fault. A components deconfiguration status is composed of the following
parts:
•requested state—What the user or Analysis Engine would like to have the component set to.
•current state—How the component is actually configured in the system.
IMPORTANT:
Deconfiguration requests for components in active nPars cannot be acted on until the nPar
experiences a power-off/power-on cycle.
Acquitting indictments
Acquitting refers to clearing the component indictment and deconfiguration statuses, and is done when
the part is serviced. Acquittals happen automatically in the following situations:
•Component insertion—HR will assume that a component inserted into the system has received any
required service. This applies to any components contained within the inserted unit as well. For
example, DIMMs and CPU sockets on an inserted blade will be acquitted. Any deconfigurations will be
reversed.
•AC power cycle or CLI poweron xfabric command — HR will assume that the required service has
been accomplished for the entire complex. All FRUs and sub-FRUs will be acquitted and reconfigured.
•Cohort acquittal—When analysis of a single fault event results in indictment or suspicion records
against multiple components, the records are linked together. If one is acquitted, the acquittal will be
passed to the cohort FRUs as well.
•HR test commands—The test camnet and test clocks commands will acquit all indictments
specific to the test to be executed. Resources that fail the test will be re-indicted as the test completes.
The test fabric command acquits each type (fabric, CAMNet, Global Clock) of indictment before
initiating the test.
NOTE: Indictments indicating faults in subcomponents not targeted by the tests will not be acquitted.
For example, a blade indictment for CPU fault will not be acquitted by any of these test commands.
•Manual Acquittal—The HR UI includes an acquit command that uses the component physical location
or resource path as a parameter. Like other acquittals, the acquittal will act on all indictments for that
component.
•Component resumes normal function.
In most cases resumption of function will not cause automatic acquittal. Component replacement,
complex AC power cycle or manual acquittal is required. Examples are as follows:
◦BPS indicted for loss of AC input regains power input.
◦Environmental temperature returns to within acceptable bounds.
◦Enclosure regains sufficient power.
◦Enclosure regains sufficient cooling.
66Troubleshooting
Viewing the list of indicted components
The show indict command will list the currently indicted components for the complex describing the
type, physical location, indication of the cause for indictment, and timestamp.
The show deconfig command will list all components in the complex which are deconfigured or have a
pending request to be deconfigured. The output includes the type, physical location, indication of the
cause for indictment, and timestamp.
myhost HR> show deconfig
System Deconfiguration List - Fri Jun 26 16:54:36 2015
To see details about a specific FRU, use 'show <loc>|<path>'
To see additional deconfiguration details, use 'show deconfig alldata'
Items listed as "Configured" may have deconfigured sub components
myhost HR>
Troubleshooting67
NOTE: The requested and current deconfiguration states shown in the examples above are not the same.
This can happen when requested deconfiguration changes are not be to acted on until the n-Par
containing the component in question is rebooted.
DIMMs might be deconfigured without being indicted or even suspected. Some faults isolated to CPU
sockets or blades might require deconfiguration of whole or portions of memory subsystems by physically
deconfiguring the DIMMs supported by that resource. Only indicted components should be replaced.
Additional DIMMs that are deconfigured without being indicted are not faulty components and should not
be replaced.
Viewing indictment acquittals
The show acquit command will list all components in the complex which have had indictments
acquitted. The output includes the type, physical location, indication of the cause for indictment, and
timestamp.
myhost HR> show acquit
System Acquittal History - Mon May 18 16:11:28 2014
FRU Type: CPU Socket
Location: 0x0100FF01FF00FF11 enclosure1/blade1/socket0
Timestamp: Mon May 18 16:11:19 2009
Indictment State: Acquitted
Requested Deconfig State: Configured
Current Deconfig State: Deconfigured
--- end report --- 2 records shown
NOTE: The requested and current deconfiguration states shown in the examples above are not the same.
This can happen when requested deconfiguration changes are not be acted on until the n-Par containing
the component in question is rebooted.
Viewing recent service history
You can view the recent service history using the show acquit command. To view the installation
history for the acquitted locations, enter show <physical location>|<resource path>.
Physical Location installation and health history
The show <physical location>|<resource path>command returns the entire stored
installation and health history of a physical location. This includes up to two previous components
installed at this location. The history will include previous indictments, with or without acquittals, rather
than just the indictments.
NOTE:
The following example illustrates BL920s Gen8 blades. The history display for BL920s Gen9 blades is
equivalent but will include different hardware.
68Troubleshooting
2014-03-17 14:12 hpsl18-4 HR> show 0x0100FF0100060A74
Location Installation/Health History - Mon Mar 17 14:12:52 2014
Timestamp: Mon Mar 17 07:42:28 2014
Indictment State: Indicted
Requested Deconfig State: Deconfigured
Current Deconfig State: Deconfigured
dimm-1/1/0/6 Location: 6A
Status: OK No Errors Logged.
--- Install History 1 ---
Discovery: Indictment
Timestamp: Mon Mar 17 04:42:18 2014
(Detailed info about the FRU is provided here if it exists. E.g., for CPUs,
max freq will be provided here. If no data, the section is omitted.)
Serial Num: 1X123456
Parent Serial: MYJ245041R
Part Num: XXX12AB3CDE4A-F5
Spare Part Num: XXX12AB3CDE4A-F5
Manufacturer ID: XX (manufacturer_name)
Product Name: DDR3 DIMM
DIMM size: 8192 MB
HPE DIMM: None
--- Action - Deconfigure ---
Event No: 7004
Provider: MemoryIndicationProvider
(Text reason and description of problem from WS-Man alert.)
Reason: Memory Uncorrectable Error.
Description: Memory Uncorrectable Error - An uncorrectable memory
error has
occurred most likely in the server's memory DIMMs, or the blade.
Bundle ID: 0x011000000000AF3D
Alert ID: 2700420140317074056
Serial Num: 1X123456
Product Name: DDR3 DIMM
- Indicted / Acquitted -
Type Timestamp Entity Reason
Ind Mon Mar 17 07:40:52 2014 CAE See reason above.
---
(SubFRUs requiring service are shown here. If none, the section is omitted.)
- SubFru Isolation -
Entire FRU indicted.
---
(Deconfigured SubFRUs are shown here. If none, the section is omitted.)
- SubFru Deconfiguration -
Entire FRU deconfigured.
---
(Cohorts are shown here. If none, the section is omitted.)
Reason: Memory Uncorrectable Error.
Description: Memory Uncorrectable Error - An uncorrectable memory
error ha
s occurred most likely in the server's memory DIMMs, or the blade.
Bundle ID: 0x011000000000AF3A
Alert ID: 2700420140317044214
Serial Num: 1X123456
Product Name: DDR3 DIMM
- Indicted / Acquitted -
Type Timestamp Entity Reason
Ind Mon Mar 17 04:42:10 2014 CAE See reason above.
Acq Mon Mar 17 07:02:28 2014 User User request.
Subcomponent isolation and deconfiguration displays
Subcomponent isolation refers to the subcomponents of a part that can require service. In these cases,
the component is indicted because the only way the subcomponent can be serviced is by removing and
servicing the entire component.
Subcomponent deconfigurations are also possible. These are indications of subcomponent failures.
The show <location>and show fru command output might contain “SubFru Isolation” and “SubFru
Deconfiguration” sections to communicate subcomponent health information. If a subcomponent
deconfiguration event occurs, the corresponding subcomponent Isolation will also be set, which triggers
an indictment of the parent component.
The sections below show examples of how the subcomponent isolation sections look.
NOTE:
The format of the deconfiguration sections look identical to those for Isolation so are not shown in the
following sections.
Blade subcomponent displays
There are several different types of subcomponent displays which can be provided for blades.
DIMMs
The DIMM subFru Isolation display is different from other subFru Isolation displays in that it
communicates DIMM loading order issues rather than faults in the subFRUs. A “1” in the display below
means the DIMM is present but not used due to a loading order issue and “0” means there is no problem
with that DIMM location. This display along with the OA CLI show blade info command output can be
used to determine which DIMMs are present and which are associated with DIMM loading errors.
For Integrity Superdome X, there are FlexLOMs instead of LOMs. Each FlexLOM has its own physical
location. Therefore, indictments against FlexLOMs are issued against the FlexLOM physical location,
rather than indicting the blade and setting one of the LOM bits. The blade SubFru isolation display will
continue to show LOM bits, but these should always have a value of 0.
Components supported by this display are as follows:
•PDHC
•OA_LAN
•USB
•NAND_Flash
•NOR_Flash
•SRAM
•PDH_FPGA
•LPM_FPGA
•RTC
Troubleshooting71
•PDH_SRAM
•iLO
Agent fabric
- SubFru Isolation -
- Blade -
- XNC -
-------------------------------
Entity name: Fault [Only the flagged entity is listed.]
---
Where Entity name is one of the following:
XNC
XNC is flagged
WJ Port n
Entire port is flagged
WJ Port n
Link Upper Half (Upper port flagged)
WJ Port n
Link Lower Half (Lower port flagged)
QPI Link n
Entire link is flagged
QPI Link n Reduced Width
Link is running at some reduced width
Where n for WJ links can range from 0 to 7 and for QPI links can range from 0 to 2.
The SubFRU deconfiguration display section has the same layout as the SubFru Isolation display.
Event logs are generated by software or firmware when an event is detected. Some events that cause
event records to be generated are as follows:
•Hardware-related.
◦Example: DIMM, CPU, VRM, XNC, or PCI-BUS failures.
•Software-related.
◦Example: indicating that firmware or software reached a certain point in the code, or that a certain
amount of time has passed, for example when a QPI LINK has a timeout.
The OA can timestamp and filter events, then store and transfer them to event log readers. Log entries
can be read by management applications in the following:
•OSs
•OAs
•SEL viewers
•FPL viewers
•Live Event viewers
•EAE
Log entries can be cleared by OS management applications or by the OA itself.
Events are classified into a number of severity levels, ranging from critical failure to non-error forward
progress notification. The severity level is encoded in the alert level data field on an event record.
Different system actions might result from generation of an event record, depending on alert level.
Live viewer
The live event viewer provides a way for you to see records as they occur. The OA supports multiple
simultaneous live event viewers that are created and destroyed dynamically when requested. The
Troubleshooting75
maximum number of simultaneous live event viewers is limited by the number of connections supported
by the OA.
Each live event viewer works independently from any other event viewer, meaning that each live event
viewer can select its own filter and format options without affecting other live event viewers.
The log can be filtered using the following items:
•blade number
•partition number
•alert level
The following format options are also available:
•Keyword—This is the default format for all viewers. The keyword format supplies the following
information about an event:
◦log number (not for livelogs)
◦reporting entity type
◦reporting entity ID
◦alert level
◦hexadecimal dump of event records
◦event ID keyword
•Raw hex—The raw hex format supplies the following information about an event:
◦hexadecimal dump of event records
•Text—The text format supplies the following information about an event:
◦log number (not for livelogs)
◦timestamp
◦alert level
◦event ID keyword
◦brief text description
◦reporting entity type
◦reporting entity ID
◦hexadecimal dump of event records
•Problem/Cause/Action—The Problem/Cause/Action format displays a problem/cause/action statement
in addition to the summary and other fields displayed by the text formatter.
To connect to the live log viewer, enter the SHOW LIVELOGS command on the Monarch OA.
NOTE:
The option C can be used to display column header information at any point of time while in the Live
viewer. The column header corresponding to the event viewer format currently active will be displayed.
76Troubleshooting
Welcome to the Live Event Viewer
WARNING: Due to connection speed and/or to the number
of events being generated and/or to the format
option selected, the live event viewer might
silently drop events.
The following event format options are available:
K: Keyword
E: Extended Keyword
R: Raw hex
T: Text
S: Cause/Action
The following alert filter options are available:
Alert filter will cause events at the selected alert filter
and below to be shown
0: Minor Forward Progress
1: Major Forward Progress
2: Informational
3: Warning
5: Critical
7: Fatal
The following event filter options are available:
B: Blade
P: Partition
V: Virtual Partition
U: Unfiltered
Current alert threshold: Alert threshold 0
Current filter option: Unfiltered
Current format option: Extended Keyword
Select new filter/format option, or <ctrl-b> to exit or <cr> to
resume display of live events, or H/? for help or 'C' to
display column header information
Location: Enclosure, Device Bay, Socket, Core, Thread AL: Alert Level
Rep Location nPar: AL Encoded Field Data Field Keyword
Ent vPar Timestamp
Both the SEL and FPL viewers provide a way for OA users to view stored event records. The OA
supports multiple simultaneous viewers. The maximum number of viewers is limited by the number of
connections supported by the OA. Each viewer works independently from any other viewer, meaning
each viewer can select its own filter options without affecting other viewers.
The logs can be filtered using the following items:
•blade number
•cabinet number (not applicable for this release)
Troubleshooting77
•partition number
•alert level
The following format options are also available:
•Keyword—This is the default format for all viewers. The keyword format supplies the following
information about an event:
◦log number
◦reporting entity type
◦reporting entity ID
◦alert level
◦hexadecimal dump of event records
◦event ID keyword
•Raw hex—The raw hex format supplies the following information about an event:
◦hexadecimal dump of event records
•Text—The text format supplies the following information about an event:
◦log number
◦timestamp
◦alert level
◦event ID keyword
◦brief text description
◦reporting entity type
◦reporting entity ID
◦hexadecimal dump of event records
•Problem/Cause/Action—The Problem/Cause/Action format displays the problem/cause/action
statement in addition to the summary and other fields displayed by the text format.
NOTE:
The display of column headers can be turned on or off using toggle option C. By default, the column
header will be on.
To connect to the FPL viewer, enter the SHOW FPL command on the Monarch OA.
Welcome to the Forward Progress Log (FPL) Viewer
The following FPL navigation commands are available:
D: Dump log starting at current block for capture and analysis
F: Display first (oldest) block
L: Display last (newest) block
J: Jump to specified entry and display previous block
+: Display next (forward in time) block
I: Changes between case sensitive and insensitive search
N: Perform previous search using last input string
?/H: Display help
C: Toggle display of column header
<Ctrl-b>: Exit viewer
The following event format options are available:
K: Keyword
E: Extended Keyword
R: Raw hex
T: Text
S: Cause/Action
The following alert threshold options are available:
Alert thresholds will cause events at the selected threshold
and below to be shown
0: Minor Forward Progress
1: Major Forward Progress
2: Informational
3: Warning
5: Critical
7: Fatal
The following event filter options are available:
B: Blade
P: Partition
V: Virtual Partition
U: Unfiltered
Current alert threshold: Alert threshold 0
Current filter option: Unfiltered
Current format option: Extended Keyword
MP:VWR (<cr>,<sp>,+,-,?,H,C,F,I,L,J,D,K,E,R,T,B,P,V,U,/,\,N,0,1,2,3,5,7,<Ctrl-b>) >
Location: Enclosure, Device Bay, Socket, Core, Thread AL: Alert Level
Event# Rep Location nPar: AL Encoded Field Data Field Keyword
Ent vPar Timestamp
To connect to the SEL viewer, enter the SHOW SEL command.
Welcome to the System Event Log (SEL) Viewer
The following SEL navigation commands are available:
D: Dump log starting at current block for capture and analysis
F: Display first (oldest) block
L: Display last (newest) block
J: Jump to specified entry and display previous block
+: Display next (forward in time) block
-: Display previous (backward in time) block
<cr>: Repeat previous +/- command
<sp>: Repeat previous +/- command
/: Search forward for input string
\: Search backwards for input string
I: Changes between case sensitive and insensitive search
Troubleshooting79
N: Perform previous search using last input string
?/H: Display help
C: Toggle display of column header
<Ctrl-b>: Exit viewer
The following event format options are available:
K: Keyword
E: Extended Keyword
R: Raw hex
T: Text
S: Cause/Action
The following alert threshold options are available:
Alert thresholds will cause events at the selected threshold
and below to be shown
2: Informational
3: Warning
5: Critical
7: Fatal
The following event filter options are available:
B: Blade
P: Partition
V: Virtual Partition
U: Unfiltered
Current alert threshold: Alert threshold 2
Current filter option: Unfiltered
Current format option: Extended Keyword
MP:VWR (<cr>,<sp>,+,-,?,H,C,F,I,L,J,D,K,E,R,T,B,P,V,U,/,\,N,2,3,5,7,<Ctrl-b>) >
Location: Enclosure, Device Bay, Socket, Core, Thread AL: Alert Level
Event# Rep Location nPar: AL Encoded Field Data Field Keyword
Ent vPar Timestamp
The CAE is a diagnostic tool that analyzes system errors and generates events that provide detailed
descriptions of severity, probable cause, recommended action, replaceable units, and more. It also
initiates self healing corrective actions.
Run the SHOW CAE command with the following options:
SHOW CAE {-L <arguments> | -E <arguments> | -C <arguments>}
To see CAE event viewer options, run the following:
OA-CLI> SHOW CAE -h
80Troubleshooting
SHOW CAE : This command can be used to view/clear the indications using the
following options
(-L) [(-e) ([eq:|ne:|le:|ge:](0|1|2|3|4|5|6|7))] |
(-L) [(-e) ([bw:(0|1|2|3|4|5|6|7),](0|1|2|3|4|5|6|7))] : Search
based on severity values:
Minor(4),Major(5),Critical(6),Fatal/NonRecoverable(7)
(-L) [(-i) (<Event ID> [,<Event ID>])] : Search
based on Event Id
(-L) [(-v) (<EventCategory Name>[,<EventCategory Name>] | all)] : Search
based on event category name or view all category names
(-L) [(-p) (<npar[:vpar]>|complex)] : Search
based on partition id or complex
(-L) [(-t) ([eq:|le:|ge:]<mm:dd:yyyy:hh:mi:ss> ] |
(-L) [(-t) ([bw:<mm:dd:yyyy:hh:mi:ss>,]<mm:dd:yyyy:hh:mi:ss>] : Search
based on time of event generation
(-L) [(-r) ([%] <summary> [%])] : Search
based on summary string
(-L) [(-s) [asc:|desc:](id|time|severity|category)] : Sort on
eventid,time,severity or category
(-L) [(-o) <offset>] : Display
from offset <offset>
(-L) [(-c) <count>] : Display
<count> number of events
(-L) [(-f)] : Display
CAE events, filter OS events
(-E) (-n) <Sl.No> : Display
event details with serial number equal to <Sl.No>
(-E) (-a) <alert id> : Display
event details with Indication Identifier/Alert Id equal to
<alert id>
(-C) (-p) (<npar[:vpar]>|complex) : Clear
events based on partition id or complex
(-G) [on|off|alert|device|status] : Enable/
Disable/Enable HPE_AlertIndication/Enable HPE_DeviceIndication/
Display
status for Athena One Stop Fault Management
(-L) [(-b)] : Display
archived events
(-E) [(-b)] (-n) <Sl.No> : Display
archived event details with serial number equal to <Sl.No>
[-h] : Display
usage of this command
To view the list of events generated and analyzed, run the following:
OA-CLI> SHOW CAE -L
Sl.No Severity EventId EventCategory PartitionId EventTime Summary
#####################################################################################################
1 Degraded 12270 Support Fi... 3 Fri Mar 28 15:53:56 2014 SFW test of SMIF over CHIF interface...
(...) indicates truncated text. For complete text see event details
To see the details for each event, run the following:
OA-CLI> SHOW CAE -E -n 1
Alert Number : 1
Event Identification :
Event ID : 12270
Provider Name : FPL_IndicationProvider
Event Time : Fri Mar 28 15:53:56 2014
Indication Identifier : 11227020140328155356
Managed Entity :
OA Name : hawk039oa1
Troubleshooting81
System Type : 59
System Serial No. : SFP1236002
OA IP Address : 15.242.4.234
Affected Domain :
Enclosure Name : hawk039
RackName : hawk039
RackUID : 02SGH5141AE2
Impacted Domain : Partition
Complex Name : hawk039
Partition ID : 3
SystemGUID : 00000000-0000-0000-0000-000000000000
Summary :
SFW test of SMIF over CHIF interface to Gromit iLO fails on the indicated blade.
Full Description :
SFW test of SMIF over CHIF interface to Gromit iLO using SMIF command ILO_STATUS_REQUEST fails, indidicating
the interface is not functional. The logical (nPar) Blade ID is sent as EventData, with 0xFFFF sent if the
blade ID cannot be determined.
Probable Cause 1 :
SMIF over CHIF interface to Gromit iLO fails selftest; resulting SMBIOS records that consume this data are
default values.
Recommended Action 1 :
Reboot the system which attempts to reinitialize the interface.
Probable Cause 2 :
Reboot of the system fails to restore SFW communication to Gromit iLO via the SMIF over CHIF interface
Recommended Action 2 :
Power off the system. Reset the offending Gromit iLO(s) in the system with one of the following:
1) destroy and recreate the partition 2) reset the blade using 'reset blade X' then confirm 'yes'
3) reset iLO and reboot the system.
Replaceable Unit(s) :
Part Manufacturer : Not Applicable
Spare Part No. : Not Applicable
Part Serial No. : Not Applicable
Part Location : Not Applicable
Additional Info : Not Applicable
Additional Data :
Severity : Degraded/Warning
Alert Type : Communications Alert
Event Category : Support Firmware
Event Subcategory : Other
Probable Cause : Communications Protocol Error
Other Event Subcategory : Gromit iLO Configuration Error
Event Threshold : 1
Event Time Window : 0 (minutes)
Actual Event Threshold : 1
Actual Event Time Window : 0 (minutes)
Record ID : 0x0
Record Type : E1
Reporting Entity : 0x0100ff03ff000017 enclosure1/blade3/cpusocket0/cpucore0
Alert Level : 0x3
Data Type : 0x16
Data Payload : 0x1
Extended Reporting Entity ID : 0x2
Reporting Entity ID : 0x1
IPMI Event ID : 0x2b05
OEM System Model : NA
Original Product Number : AH337A
Current Product Number : AT147A
OEM Serial Number : NA
Version Info :
Complex FW Version : 7.6.0
Provider Version : 5.111
Error Log Data :
Error Log Bundle : 400000000001e86c
See the HPE Integrity Superdome X and Superdome 2 Onboard Administrator Command Line Interface
Guide for the correct and detailed command syntax. The HR Viewer can also provide help in visualizing
component issues.
OA
The OA provides diagnostic and configuration capabilities. See the HPE Integrity Superdome X and
Superdome 2 Onboard Administrator Command Line Interface Guide for more information on the OA CLI
commands. You can access the OA CLI through the network.
The status logs consist of the following:
82Troubleshooting
•System Event
•Forward Progress
•Live Events
Remotely accessing the OA
The OA CLI can be accessed remotely through any Telnet or SSH session.
Telnet session
Procedure
1. From a network-connected client, open a command-line window.
2. At the prompt, enter telnet <OA IP address>, and then press Enter.
3. For example, telnet 192.168.100.130.
4. Enter a valid user name, and then press Enter.
5. Enter a valid password, and then press Enter. The CLI command prompt appears.
6. Enter commands for the OA.
7. To end the remote access Telnet session, at the CLI command prompt, enter Exit, Logout, or Quit.
SSH session
1. Start an SSH session to the OA.
2. Enter ssh -l <username><IP-address>.
Example:
ssh -l Administrator 16.113.xx.yy
The authenticity of host '16.113.xx.yy(16.113.xx.yy)' can't be established.
DSA key fingerprint is ab:5e:55:60:2b:71:8f:0c:55:3e:79:3e:a2:93:ea:13
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '16.113.xx.yy' (DSA) to the list of known hosts.
------------This is a private system. Do not attempt to login unless you are an
authorized user. Any authorized or unauthorized access and use may be
monitored and can result in criminal or civil prosecution under applicable
law.
Firmware Bundle Version: 5.73.0
Enclosure Number: 1
OA Number: 1
OA Role: Active
Administrator@16.113.xx.yy's password: <Administrator password>
Troubleshooting83
3. At the CLI command prompt, enter OA commands.
4. To end the remote access SSH session, at the CLI command prompt, close the communication
software or enter Exit, Logout, or Quit.
Locally accessing the OA
If needed for debugging purposes, the OA can be accessed locally through a serial port connector on the
rear of the OA module. Use a laptop or another computer as a serial console to communicate with the
OA.
NOTE: Use of this interface is only for OA debugging purposes and to reset the OA password. This
connection cannot be maintained under normal server operations
Procedure
1. Connect a serial cable between the computer and the serial port on the OA module. See Connecting
a PC to the OA serial port for detailed information on this connection and launching the OA CLI.
2. When prompted, enter a valid user name, and then press Enter.
3. Enter a valid password, and then press Enter. The CLI command prompt appears.
4. Enter commands for the OA.
5. To end the terminal session, enter Exit at the prompt.
NOTE: If the serial console session for a partition is not closed properly, it will impact the speed of the
associated partition console.
Troubleshooting processors
Cause
There are several type of errors concerning the processor environment.
•EFI—typically occur during boot or runtime.
•Boot errors—typically related to a core failing self test, a QPI link not initializing to full speed, or a core
or socket not coming out of reset.
•Runtime errors—can be due to a hardware or software defect that appears in either a core or uncore.
•I/O and XNC errors—consult the CAE error logs. Most common I/O errors are surprise down and
completion timeouts.
•Uncore errors—result in the entire socket indicted and the blade deconfigured, since these errors
affect all cores. If an uncore error is specific to a core, then the core can be deconfigured on the next
boot and the rest of the cores on the socket are unaffected. The most common uncore errors are
errors in the last level cache, firmware errors, or timeouts.
•Core errors—typically first or mid-level cache errors, core-level time-outs, and hardware defects.
•SMI/SMI2 errors
To troubleshoot processor errors, use the OA SHOW CAE-L command. Use the HR SHOW INDICT
command to check for indications that a component might be failing.
Memory errors can be separated into two categories depending on where they originate:
•CPU to memory buffer errors — outlined in yellow below
•Memory buffer to DIMM errors — outlined in green below
Solution 1
Cause
CPU to memory buffer errors
The link between the CPU and the memory buffer is the SMI2 or VMSE link. An SMI2 failure can manifest
as reduced memory size, reduced memory throughput, or machine checks. However, other issues can
result in the same symptoms. CAE will analyze the failure to determine whether SMI2 is at fault.
For errors related to SMI2, suspect the CPU, the memory buffer, or the traces between them. The
memory buffer is permanently attached to the blade, so it cannot be indicted independently. Therefore,
the CPU and/or blade are indicted for an SMI2 error.
If an error occurs on SMI2, replacing DIMMs is unlikely to correct the problem. DIMMs reside on a
separate DDR bus and changes to the DDR bus will not affect the SMI2 bus.
IMPORTANT: Do not move or replace DIMMs for an SMI2_TRAINING_FAILURE event.
Troubleshooting85
Solution 2
Cause
Memory buffer to DIMM errors
The channel between the memory buffer and the DIMM is the DDR channel. Because up to three DIMMs
reside on the same DDR channel and two DDR channels might be configured in lockstep (RAS mode
enabled), up to six DIMMs can be affected by a single faulty DIMM. It is important to distinguish faulty or
suspect DIMMs from healthy DIMMs that happen to reside on the same bus.
On a new installation, DDR training failures can result from DIMMs being partially unseated during
shipping. A common symptom of a partially unseated DIMM is a MEM_DIMM_NO_VALID_DELAY event.
If the machine is still in the installation phase and has not been released to the customer, before replacing
a DIMM, try removing and reinstalling all the DIMMs on that DDR channel. A DIMM that has been in use
for some time is unlikely to be spontaneously unseated.
If a DIMM suffers a correctable or uncorrectable error at runtime and must be replaced, a DIMM pair
might be identified and indicted. A DIMM pair will be two DIMMs on the same memory buffer with the
same loading letter, such as 19A and 24A. In this case, replace both DIMMs in the pair.
CAE generates error events for faulty or suspect DIMMs as indicted. Replace these DIMMs.
Health Repository, the EFI info mem command, and IPMI events might also identify additional
deconfigured DIMMs, sometimes called partner-deconfigured DIMMs, lockstep-disabled DIMMs, or
sibling-disabled DIMMs. These DIMMs are healthy and should not be replaced.
To identify a possible faulty DIMM, use the HR SHOW INDICT command. Replace DIMMs that are
indicted. Do not replace DIMMs that are deconfigured unless there are other indications of a faulty DIMM,
such as being identified with DIMMERR.
Solution 3
Cause
Using DIMMERR
If there are memory errors that do not clearly indicate which hardware is at fault, the HR dimmerr
command can be used to look for patterns of memory failures.
You can use DIMMERR as follows:
•To corroborate other errors that correspond to a specific DIMM or blade.
•To indicate memory training faults.
•To look for DIMM errors in newly installed or replaced DIMMs.
•To look for DIMM errors during partition boot as part of a system installation.
IMPORTANT: DIMMERR will show memory events that were correctable. It is important to note
that correctable errors are expected on large memory systems and all systems will show several
correctable errors over time. Correctable errors only result in indictment after reaching a certain
threshold.
Do not replace DIMMS for normal correctable errors.
From the Health Repository viewer, enter dimmerr <location>, where <location> is the DIMM slot or
a blade.
Example: dimmerr blade-1/1 returns information about all DIMMs for a server blade in slot 1 of
cabinet 1.
86Troubleshooting
DIMM INFO for Cabinet: 1 Board Slot: 1
dimm-1/1/0/1 Location: 1A
Status: OK No Errors Logged.
dimm-1/1/0/2 Location: 2C
Status: OK No Errors Logged.
dimm-1/1/0/3 Location: 3B
Row Bank Col Type Errors First Detected Last Detected
0 256 0 0 1 Fri Feb 11 18:10:51 2011 Fri Feb 11 18:10:51 2011
dimm-1/1/0/4 Location: 4D
Status: OK No Errors Logged.
dimm-1/1/0/5 Location: 5D
Status: OK No Errors Logged.
Troubleshooting cards and drivers
Cause
If driver issues are suspected, use the UEFI driver bypass option to bypass loading the suspected driver.
This could occur if a card is transferred from another system with an old driver and is placed in a new
system and connecting drivers results in failure to boot.
The UEFI driver loading bypass option only appears and is effective during system firmware boot. It does
not appear if the UEFI Front Page is re-entered later.
Normally, system firmware will proceed with automatic boot entry execution (default is seven seconds). To
configure UEFI driver loading bypass, you must press P before the countdown completes to access the
UEFI Driver Loading Bypass Configuration menu.
After pressing the key, a submenu will appear. Select the desired bypass option by pressing a key as the
following indicates:
UEFI Driver Loading Bypass Configuration
Press: 1 — Bypass loading UEFI drivers from I/O slots
2 — Bypass loading UEFI drivers from I/O slots and blade LOMs
N / n - Normal loading of UEFI drivers
Q / q - Quit
Waiting for user input.
The Bypass loading UEFI drivers from I/O slots and blade LOMs option might be useful when a bad
FlexLOM and/or mezzanine card UEFI driver is preventing partition boot. USB drivers can still be used at
the UEFI Shell to help with FlexLOM update.
NOTE: There is no quick reset ability to save time when you are running the bypass option several times
in a row.
After selecting an option, control returns to the UEFI Front Page.
You can then proceed with I/O firmware update (SUM from DVD/Virtual Media .iso).
Troubleshooting compute enclosure events
Cause
Loss of enclosure settings
The OA battery preserves the Integrity Superdome X enclosure settings, such as users and network
settings. When the battery is low, there is a risk of losing these enclosure settings if the OA is removed or
if AC power is interrupted.
Troubleshooting87
When the OA detects a low battery, the battery diagnostic status in SHOW OA STATUS will be marked as
Failed.
sdx-oa> show oa status
Onboard Administrator #1 Status:
Name: sdx-oa
Role: Active
UID: Off
Status: Degraded
Diagnostic Status:
Internal Data OK
Device Failure OK
Missing Device OK
Firmware Mismatch OK
OA Battery Failed
Indicted OK
If the above error occurs, the battery should be replaced. The OA will also log an entry in syslog
advising the battery be replaced.
The OA battery is low or has failed. Configuration settings may be lost if the OA loses power.
Replace the OA Battery with spare part #708907-001.
Troubleshooting firmware
Cause
There are three different firmware systems.
•System firmware bundle
•IO firmware (PCIe and LOM)
•Interconnect module firmware
All firmware systems can be updated.
System firmware recipe can be updated using SUM or manually using OA CLI. There are different
bundles for each method.
For instructions to update firmware and drivers, see Manually updating the complex firmware on page
34 and Installing the latest complex firmware using SUM on page 34.
For more information about installing firmware updates, see the detailed instructions provided in the
firmware download bundle. Always follow the update instructions for each firmware release.
Identifying and troubleshooting firmware issues
NOTE: Firmware issues are relatively rare. Look for other issue causes first.
Probable firmware failure areas are:
•Unsupported firmware installation
•Corrupt firmware installation
To troubleshoot firmware issues:
88Troubleshooting
Procedure
1. Be sure that all server blade firmware components are from the same release (use the OA CLI
update show firmware command, or check the Complex Firmware version through the OA GUI).
2. Reinstall complex firmware.
Verifying and installing the latest firmware version
Hewlett Packard Enterprise recommends that all firmware on all devices be updated to the latest version
after hardware installation is complete. Hewlett Packard Enterprise also encourages you to check back
often for any updates that might have been posted.
The most recent versions of software drivers and firmware are available on the support page.
Procedure
1. Go to http://www.hpe.com/support/hpesc.
2. Enter the product name or browse to the product.
3. Select drivers, software & firmware under the Download options tab.
4. Select the product download type.
5. Select a language and then your OS.
6. Select the appropriate download, and then follow the instructions.
NOTE:
The complex (or management side) firmware can be updated while the partition remains online, and then
the partition (or system side) firmware can be applied to the nPartition.
It is possible that some firmware updates will be released which do not require partition firmware updates.
These firmware bundles can be installed without requiring any nPartition downtime.
See the detailed instructions provided in the firmware download bundle for more information.
System firmware
System firmware bundle includes firmware for complex components including the following:
•Server blade firmware (not including LOMs)
•Partition firmware for each server blade and OA
•OA firmware
•Manageability module firmware, including GPSMs and XFMs
Troubleshooting89
IMPORTANT:
Always use the all option when updating firmware using the OA CLI. For example:
OA1> update firmware usb://d2/BL920sGen<x.x>.xx.xxx-fw.bundle all
OA1> update firmware ftp://user:passwd@Hostname/HPx86/BL920sGen<x-x>.<xx.xxx>fw.bundle all
If the all option is not used, only the complex firmware will be updated, and you will have to update
the partition firmware. This will create additional down time.
NOTE: The update firmware command checks the installed FRUs and will only update FRUs that do
not match the complex firmware version.
FRU replacement firmware update procedures
The following table explains the steps to take, and the overall impact each FRU replacement will have on
system operation:
IMPORTANT: Check for indicts before and after each firmware update.
90Troubleshooting
FRUProcess
Blade – Requires a nPar outage
XFM — Requires a Complex outage
1. Power OFF the partition the blade is assigned to. (See
Note following this table)
2. Remove/Replace the suspect blade following the
instructions in the service guide.
3. Use the update firmware <uri> all command,
pointing it to the <uri> of a bundle file that matches what is
installed on the complex. This command checks the
current firmware version of all installed FRUs and will only
update FRUs that do not match the complex firmware
version.
4. Check for indicts.
5. Power on the partition.
1. Power OFF all partitions.
2. Remove and replace the suspect XFM following the
instructions in the service guide.
IMPORTANT: Do not mix XFM and XFM2 crossbar
modules in the same system.
3. Use the update firmware <uri> all command,
pointing it to the <uri> of a bundle file that matches what is
installed on the complex. This command checks the
current firmware version of all installed FRUs and will only
update FRUs that do not match the complex firmware
version.
NOTE: The minimum firmware bundle for XFM2 is
v8.2.106.
4. Check for indicts.
5. Power on all partitions.
Table Continued
Troubleshooting91
FRUProcess
OA — No outage required
GPSM — No outage required
1. Ensure that the suspect OA is the standby OA; use the
force takeover command if needed.
2. Remove and replace the suspect OA.
3. Use the update firmware <uri> all command,
pointing it to the <uri> of a bundle file that matches what is
installed on the complex. This command checks the
current firmware version of all installed FRUs and will only
update FRUs that do not match the complex firmware
version.
4. Check for indicts.
1. Ensure that you are replacing the indicted GPSM.
2. Disconnect the cables from the GPSM being replaced.
3. Remove and replace the suspect GPSM.
4. Use the update firmware <uri> all command,
pointing it to the <uri> of a bundle file that matches what is
installed on the complex. This command checks the
current firmware version of all installed FRUs and will only
update FRUs that do not match the complex firmware
version.
NOTE: For blade replacement: If the FRU failed in a way that made it unable to join the partition after
the failure, you might not need to shut down the partition at the time of the replacement. The FRU can be
replaced and the firmware updated. When the partition is rebooted, the replacement FRU will rejoin the
partition.
I/O firmware
Every FlexLOM and mezzanine card supported requires its own UEFI driver and some also require card
specific ROM firmware.
For a complete list of supported I/O cards and related firmware, see the Firmware Matrix for HPE IntegritySuperdome X servers document at http://www.hpe.com/info/superdomeX-firmware-matrix.
The following are minimum required firmware versions for supported I/O cards.
5. Check for indicts.
NOTE: You will see indictments related to the loss of
redundancy of the CAMNet.
6. Acquit the indictments related to the loss of redundancy of
the CAMNet.
92Troubleshooting
CardGen8 minimum firmware
version
Gen9 minimum firmware
version
HPE Ethernet 10Gb 2-port
560FLB / 560M Adapter
HPE QMH2672 16Gb 2P FC
HBA
Infiniband HPE IB FDR 2P 545M
Adapter
HPE FlexFabric 20Gb 2P
630FLB / 630M Adapter
Boot: 3.0.24
UEFI: 4.5.19
Multiboot: 2.02.47 & 4.0.0.0–1
FW: 7.04.00
BIOS: 3.28
UEFI: 6.21
Boot: 2.3.45
UEFI: 4.9.10
Multiboot: 2.02.47 & 4.0.0.0–1
FW: 7.04.00
BIOS: 3.31
UEFI: 6.37
FW: 10.10.50.52
UEFI: 14.6.27
Flexboot: 3.4.306
MFW: 7.10.72
MBA: 7.10.71
EFI: 7.12.83
UEFI: 7.12.31
iSCSI Boot: 7.10.33
CCM: 7.10.71
HPE FlexFabric 20Gb 2P
650FLB / 650M Adapter
HPE FlexFabric 10 Gb 2–port
534FLB / 534M Adapter
Interconnect module firmware
The system supports the LAN Pass-Thru Module, the HPE ProCurve 6120XG and 6125XLG blade
switches, and the HPE 4X FDR Infiniband Switch.
Symptoms of possible firmware issues include erratic server blade, compute enclosure, or other
component operation, or unsuccessful boot to the UEFI boot manager or UEFI shell.
The following are minimum required firmware versions for supported Interconnect modules.
Interconnect moduleFirmware version
ProCurve 6125XLG blade switch6125-CMW520-R2112
Boot: 7.10.37
UEFI: 7.10.54
L2FW: 7.10.31
FW: 10.7.110.34
iSCSI Boot EFI: 10.7.110.15
UEFI: 10.7.110.34
iSCSI BIOS: 107.00a9
Boot: 7.12.83
7.12.31
ProCurve 6120G/XG Ethernet Blade SwitchZ.14.52
Table Continued
Troubleshooting93
Interconnect moduleFirmware version
10 GB Ethernet Pass-Thru1.0.11.0
Brocade 16Gb SAN switch7.3.1a or later
4X FDR Infiniband Switch3.4.0008
Troubleshooting partitions
Cause
Use the following commands to troubleshoot partitions:
•Use the OA parstatus command to determine which resources belong to the failing nPar.
•Use the HR> show indict and show deconfig commands to determine if any of the resources
belonging to the nPar are deconfigured, indicted, or in any failure state.
If any issues are reported, use the show CAE command for more information.
•Use the show syslog OA 1 command to check the syslog file for the active OA. For example:
OA-CLI> show syslog oa 1
Mar 28 17:20:59 mgmt: Blade 8 has been allocated 1100 watts but iLO is reportin
g the blade is powered off.
Mar 28 17:21:24 mgmt: Blade 1 Ambient thermal state is OK.
Mar 28 17:21:24 mgmt: Blade 3 Ambient thermal state is OK.
Mar 28 17:21:24 mgmt: Blade 5 Ambient thermal state is OK.
Mar 28 17:21:44 mgmt: Blade 7 Ambient thermal state is OK.
Mar 28 17:26:31 parcon: Note: Partition Controller has initialized all partition
permissions to the default behavior
Mar 28 17:28:53 parcon: Note: nPartition 2: Power On of nPartition completed
Mar 28 17:29:37 mgmt: Blade 2 Ambient thermal state is OK.
Mar 28 17:29:37 mgmt: Blade 4 Ambient thermal state is OK.
Mar 28 17:29:37 mgmt: Blade 6 Ambient thermal state is OK.
Mar 28 17:29:58 mgmt: Blade 8 Ambient thermal state is OK.
Mar 28 17:33:12 -cli: Administrator logged out of the Onboard Administrator
Mar 28 17:33:14 -cli: Administrator logged out of the Onboard Administrator
Mar 28 17:33:16 -cli: Administrator logged out of the Onboard Administrator
Mar 28 17:33:22 -cli: Administrator logged out of the Onboard Administrator
Mar 28 17:33:24 -cli: Administrator logged out of the Onboard Administrator
NOTE: All partition-related messages in OA syslog contain the string parcon:.
See the HPE Integrity Superdome X and Superdome 2 Onboard Administrator Command Line InterfaceUser Guide for information on uploading and downloading partition specification files and runtime
configuration files. These actions are not typically needed, but it is recommended to keep a valid copy of
the configuration available for disaster recovery.
Troubleshooting the network
Cause
An incorrect setup for the compute enclosure and complex wide internal network can lead to issues with
the following tasks:
94Troubleshooting
•Powering on/off partitions
•Update firmware
•Gathering status information
Each Monarch iLO and OA in the complex must have a unique IP address set up. The IP addresses will
be obtained by either using a DHCP server or defining the IP addresses using EBIPA. Non-Monarch iLO
addresses default to link local.
Supported IP address ranges for EBIPA
Supported IP address ranges for EBIPA include all IP addresses except those in the ranges of
169.254.x.y and 10.254.x.y, which are reserved for the internal management network. The non-restricted
ranges may be used for iLOs and OAs as long they are not duplicated (generate IP address conflicts). In
addition, all the IP addresses must be within the same subnet defined by netmask and IP address so that
all OAs as well as all iLOs fit into that subnet.
Use the show ebipa and show OA network all commands to check the network settings for iLO
and OA:
SHOW EBIPA
EBIPA Device Server Settings
Bay Enabled EBIPA/Current Netmask Gateway DNS Domain
1A No
1B No
2 Yes Link Local 255.255.254.0 10.67.52.1
10.67.52.165
2A No
2B No
SHOW OA NETWORK ALL
Onboard Administrator #1 Network Information:
Name: OA-1
DHCP: Disabled
IP Address: 10.67.52.bbb
Netmask: 255.255.254.0
Gateway Address: 10.67.52.aaa
Primary DNS: 0.0.0.0
Secondary DNS: 0.0.0.0
MAC Address: 9C:8E:99:29:xy:yx
Link Settings: Auto-Negotiation, 1000 Mbit, Full Duplex
Link Status: Active
Enclosure IP Mode: Disabled
Onboard Administrator #2 Network Information:
Name: OA-2
DHCP: Disabled
IP Address: 10.67.52.ccc
Netmask: 255.255.254.0
Gateway Address: 10.67.52.aaa
Primary DNS: 0.0.0.0
Secondary DNS: 0.0.0.0
MAC Address: 9C:8E:99:29:xy:xy
Link Settings: Auto-Negotiation, 1000 Mbit, Full Duplex
Link Status: Active
Enclosure IP Mode: Disabled
Troubleshooting95
Troubleshooting fabric issues
Cause
The Integrity Superdome X has fabric connections between all the blades installed in the compute
enclosure.
Test fabric
To determine the healthy status for all crossbar connections, use the HR> test fabric command. This
is a valuable test during installation when all partitions can be taken down at the same time. During
normal operation when some or all partitions can’t be taken down at the same time, use the procedure
described in Show complex status below.
IMPORTANT: The HR> test fabric requires a complex outage. Before running HR> test
fabric, all indicted and deconfigured parts must be cleared and the partition must be powered off.
NOTE: Test fabric includes both test camnet and test clocks.
OA1 HR> test fabric
Begin test 1: System Fabric Components
Acquitting any current fabric and CAMNet indictments, and deconfigurations.
Beginning fabric test
SUCCESS: System Fabric test complete
System Fabric routed successfully.
Begin test 2: Management Network Components
CAMNet test has executed without finding faults
Management connectivity test complete
Begin test 3: Global Clock Components
Clocks test started...
Blade Sys Clk 0 Sys Clk 1
========== ========== ==========
Blade 1/1 OK OK
Blade 1/2 OK OK
Blade 1/3 OK OK
Blade 1/4 OK OK
Blade 1/5 OK OK
Blade 1/6 OK OK
Blade 1/7 OK OK
Blade 1/8 OK OK
GPSM Int Clk Ext Clk
========== ========== ==========
GPSM 1/1 * OK ---GPSM 1/2 * OK ----
SUCCESS: Clocks test passed.
Clocks test complete.
Success: Fabric, CAMNet, and Global Clock tests completed with no errors
Show complex status
Use this procedure to test for fabric issues when some or all partitions can’t be taken down at the same
time.
96Troubleshooting
Action
1. Run SHOW XFM STATUS all to check the health and power status of the XFM modules.
2. Run SHOW COMPLEX STATUS and check the Xfabric status entry for the status.
3. Check SHOW CAE —L and check for any xfabric routing issues and fabric link failures.
Troubleshooting clock-related issues
Cause
Clocks are provided by the GPSM module and are redundant within a complex. Use the command HR>
test clocks to check for clock-related issues as follows:
NOTE: This command can be run while the partitions are active.
HR> test clocks
Clocks test started...
Blade Sys Clk 0 Sys Clk 1
========== ========== ==========
Blade 1/1 OK OK
Blade 1/2 OK OK
Blade 1/3 OK OK
Blade 1/4 OK OK
Blade 1/5 OK OK
Blade 1/6 OK OK
Blade 1/7 OK OK
Blade 1/8 OK OK
GPSM Int Clk Ext Clk
========== ========== ==========
GPSM 1/1 * OK ----
GPSM 1/2 * OK ----
SUCCESS: Clocks test passed.
Clocks test complete.
Any clock failures will also be detected and reported by CAE. To obtain these failures, run show CAE –L,
and then use the command show CAE –E –n <ID> to obtain more details for the CAE event.
Troubleshooting MCAs
Cause
In general, MCAs are partition-based crashes and are detected and reported by CAE. To obtain a general
overview about an MCA event, run show CAE –L, and then use the command show CAE –E –n <ID>
to obtain more details for the CAE event.
To view problem action statements about the MCA event, use the show cae —L —c 10 command and
note the Sl.No. Then display detailed information about the bad FRU including probable cause and
recommended action by using the show cae -E -n xxxxcommand, where xxxx is the Sl.No.
show cae -L -c 10
Sl.No Severity EventId EventCategory PartiionId EventTime Summary
##########################################################################################
72294 Fatal 9645 System Fir... 1 Wed Aug 13 07:10:57 2014 The nPartitions
72287 Degraded 100142 System Int... 1 Wed Aug 13 06:35:0^ 2014 PCIe Link
Troubleshooting97
show cae -E -n 72287
Alert Number : 72287
Event Identification :
Event ID : 100142
Provider Name : PCIeIndicationProvider
Event Time : Wed Aug 13 06:35:06 2014
Indication Identifier : 310014220140813063506
Managed Entity :
OA Name :
System Type :
System Serial No. :
OA IP Address :
Affected Domain :
Enclosure Name :
RackName :
RackUID :
Impacted Domain :
Complex Name :
Partition ID :
SystemGUID :
Summary :
PCIE Link Bandwidth Reduction
Full Description :
The system has experienced an error on PCIe link. The data has been successfully retransmitted,
but the link is now operating at a lower bandwidth.
Probable Cause 1 :
The PCIe link hardware is not functioning properly.
Recommended Action 1 :
The PCIe link might be part of a single FRU, or might be technology that connects through multiple
FRU's. The FRU list is included as a reference. Check for physical damage (bent pins, cracked
traces, contamination or corrosion) on the FRU connection points and ensure proper mating/
seating occurs. If the problem persists, replace only one FRU at a time in the order given
below. Test the system between each FRU replacement.
Replaceable Units(s) :
...
...
...
MCA data is also stored at the OA and can be retrieved by running the OA command show errdump
dir mca as follows:
OA-CLI> show errdump dir mca
Logtype: MCA (Machine Check Abort)
Bundle nPar vPar time
0x011000000000aae6 1 Mon Jan 20 10:30:31 CET 2014
0x011000000000aae5 1 Fri Jan 17 12:23:49 CET 2014
0x011000000000aae4 1 Fri Jan 17 10:51:06 CET 2014
0x011000000000aae3 1 Thu Jan 16 21:43:45 CET 2014
0x011000000000aae2 1 Mon Jan 13 11:44:30 CET 2014
0x011000000000aae1 1 Mon Jan 13 11:43:27 CET 2014
0x011000000000aadf 1 Tue Dec 10 01:07:39 CET 2013
0x013000000000aac0 1 Sun Dec 8 01:12:08 CET 2013
0x011000000000aadd 1 Sat Dec 7 01:58:05 CET 2013
0x011000000000aadc 1 Sat Dec 7 01:57:02 CET 2013
If an MCA of interest is found, it can be captured by running the command show errdump mca bundle
<ID>.
Troubleshooting the blade interface (system console)
Cause
All system console connections are made through the OA CLI via the management network.
98Troubleshooting
Linux uses the OA 10/100 BT LAN connection over a private network to control one or more server blade
operations, locally through Telnet or SSH or remotely over a public network through a web GUI.
Troubleshooting99
Websites
General websites
Hewlett Packard Enterprise Information Library
www.hpe.com/info/EIL
Single Point of Connectivity Knowledge (SPOCK) Storage compatibility matrix
www.hpe.com/storage/spock
Storage white papers and analyst reports
www.hpe.com/storage/whitepapers
For additional websites, see Support and other resources.
100 Websites
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.