System Memory Troubleshooting Best Practices for HP
ProLiant Servers
Accurately troubleshooting system memory issues in ProLiant server configurations is an important process that
can help prevent unnecessary replacement of hardware components. In addition, accurate problem diagnosis
prevents customers from experiencing unnecessary downtime while waiting for hardware that may not need
to be replaced. Following standard troubleshooting guidelines and using them each time a memory issue is
suspected helps to establish this.
HP has developed several methods for troubleshooting memory problems in ProLiant servers.
The purpose of this white paper is to assist HP customers in troubleshooting system memory problems by
successfully isolating the specific DIMMs causing the problem. This helps to prevent nonessential replacement
of unaffected DIMMs or, in some cases, entire banks of memory. In addition, effective troubleshooting can
help determine if a firmware or other software download can resolve a problem without replacing hardware.
This white paper covers the following topics:
• Why should I troubleshoot every system memory problem?
• How can I tell if a memory problem has occurred?
• What tools are available from HP to help identify a failing DIMM?
• Troubleshooting using HP Insight Diagnostics Online Edition
• Troubleshooting using HP Insight Diagnostics Offline Edition
• Troubleshooting flowchart for bootable systems
• Troubleshooting flowchart for non-bootable systems
• What
• Why Buy HP Memory?
• Other troubleshooting resources
role does firmware play in solving system memory problems?
Why Should I Troubleshoot Every Memory Problem?
Accurate diagnosis of system memory problems in ProLiant servers has many advantages, including:
• Prevents unnecessary hardware replacement.
• Prevents the return of parts that test NFF (No Fault Found).
• Prevents server downtime.
Best Practice
correctable with a firmware update. HP strongly recommends checking for a firmware update before
sending a part back to HP for replacement. Based on the HP ProLiant product return rates, a
significant percentage of all returned hardware products were functioning properly and only needed
a firmware update. Although not all products fall into this category, server downtime and time spent
removing, returning, and ultimately replacing hardware may have been avoided if an attempt had
been made to flash the firmware during the troubleshooting process.
How can I tell if a Memory Problem has Occurred?
: Many product issues that result in hardware replacement are preventable or
There are many indicators that a problem has occurred within the memory subsystem. HP has several tools
used to identify the status of hardware and software within a system. Using these tools is a good first step in
the troubleshooting process. When a system memory problem is suspected, check one or all of these
common places to find information:
• HP System Management Homepage (SMH)
• HP Systems Insight Manager (HP SIM)
• Server Logs
• DIMM Slot LEDs
IMPORTANT:
When a memory error is detected, the firmware illuminates the fault LEDs located near each DIMM
slot on the system board.
If the system identifies an error to a specific slot, that LED illuminates. However, if the system can only
identify an error within a bank, but cannot isolate a specific DIMM, all the LEDs in the bank will
illuminate.
In addition, if the system cannot identify the bank in which the error has occurred, all the LEDs in all
2
banks illuminate, making the task of isolating the failing DIMM more difficult and the chance of
replacing functioning banks of memory more likely.
Therefore, further troubleshooting is necessary to determine which specific DIMM is failing. Use the
LEDs as a tool in identifying that a memory problem may exist, but do not rely solely on the status of
the LEDs to determine if hardware should be replaced.
What Tools are Available from HP to Help Identify a Failing DIMM?
Refer to the any of following HP system tools whenever a memory problem is suspected.
HP System Management Homepage
The HP System Management Homepage supplies a consolidated view of system hardware health,
configuration, performance and status information for individual HP servers. Details are provided on total
system health, including system memory. Information on system memory can be found under the
“Performance” section on the main page (See Figure 1). For Linux and Windows, HP SMH is available in the
ProLiant Support Pack and Integrity Support Pack. To download the latest version of the ProLiant Support Pack
or Integrity Support Pack, navigate to the Support and Troubleshooting link on
http://www.hp.com.
3
Figure 1: Overview of the HP System Management Homepage
HP Systems Insight Manager (HP SIM)
HP Systems Insight Manager monitors the health of the hardware in the system and polls installed hardware
for its status every few minutes. Refer to Figure 2 below for an example of events displayed on the System
page. For more information on HP SIM, refer to the following URL:
erver system logs record the status of hardware events, including memory issues. For servers running
S
Microsoft Windows operating systems, either of the following logs can be a valuable resource:
• Integrated Management Log (IML)
• Event Viewer
r servers running Linux operating systems, refer to either of the following:
Fo
• Integrated Management Log (IML)
• varlog/messages file
icrosoft Windows Operating Systems: Using the IML
M
e IML Viewer is a software tool created by HP and is available as a downloadable component pack from
Th
HP.COM. It can also be accessed via the HP System Management Homepage (SMH). Navigate to this tool
through SMH by clicking on the Logs tab or through the operating system from HP System Tools. Figure 3
below shows the Integrated Management Log accessed via SMH. System memory issues, if present, will b
recorded and will be visible in the IML.
e
Figure 3: Integrated Management Log
6
Microsoft Windows Operating Systems: Using Event Viewer
e Event Viewer is a software tool available as part of Microsoft Windows operating systems. It can be
Th
accessed by navigating to HP System Tools via the Start menu. Figure 4 shows an example of server even
that are logged in this tool.
ts
Figure 4: Event Viewer (Microsoft Windows Operating Systems)
nux Operating Systems: Using the IML
Li
e IML Viewer is a software tool created by HP and is available as a downloadable component pack from
Th
HP.COM. For systems running Linux, type the command hplog –v to view the IML log and check for system
memory errors. From the Linux Command Prompt, type “hplog –v” and the entire IML will be displayed. Any
detected system memory error will be logged, including any pre-failure warranty memory events on systems
where applicable.
7
Linux Operating Systems: Using /var/log/messages File
The Linux system log (/var/log/messages) can be viewed using the cat, more and less commands. The
following types of messages may be logged here if a memory problem has occurred:
DIMM slot LEDs can be helpful indicators of issues with system memory. Figure 5 below gives an example of
the location of LEDs that will illuminate when certain DIMM slots indicate a failure. The example below is from
a ProLiant DL380 G4 system board; however, similar graphics for specific systems can be found in server
user guides and other system documentation.
Memory Problem Indicators
DIMM failure
5
slot 6C
DIMM failure
6
slot 5C
DIMM failure
7
slot 4B
DIMM failure
8
slot 3B
Example DL 380 G4
DIMM Slot LEDs on the System Board
9/5/200612
Figure 5: DiMM Slot LEDs
DIMM failure
9
1
DIMM failure
0
slot 2A
slot 1A
Amber = Memory
failed
Off = Normal
Amber = Memory
failed
Off = Normal
Amber = Memory
failed
Off = Normal
Amber = Memory
failed
Off = Normal
Amber = Memory
failed
Off = Normal
Amber = Memory
failed
Off = Normal
8
Once a system memory problem is suspected based on one of the methods outlined above, the first step is to
schedule downtime for troubleshooting. HP recommends using HP Insight Diagnostics, available on the HP
SmartStart CD, to begin the troubleshooting process.
HP Insight Diagnostics is a proactive server management tool that is available in offline and online editions. It
provides diagnostics and troubleshooting capabilities to help locate system and component problems.
Features of HP Insight Diagnostics:
• Works on multiple operating systems
• Can be used offline or online
• HP Insight Management Agents and the System Management driver can be leveraged
• Common Diagnostic Model (CDM) compliance.
For more information on HP Insight Diagnostics, refer to the following URL:
Troubleshooting Using HP Insight Diagnostics Online Edition
HP Insight Diagnostics Online edition is available in HP ProLiant Support Packs.
The Insight Diagnostics Online edition performs various non-intrusive in-depth system and component
diagnoses while the operating system is running. The “survey feature” can identify and resolve problems
without taking the system down.
Troubleshooting Using HP Insight Diagnostics Offline Edition
Insight Diagnostics Offline edition is available in HP SmartStart by choosing “Launch server diagnostics.”
Some benefits of HP Insight Diagnostics are:
•Performs extensive in-depth system and component testing while in a controlled operating
environment.
•Survey feature enables IT administrators to track hardware and software changes in order to form a
complete and thorough auditing process for the system.
•Test results can be analyzed by IT administrators to diagnose system and component problems in
order to repair and return servers back to the production environment.
9
Online Vs. Offline Testing
Offline is the preferred testing method because it is the most accurate. When testing offline, there is minimal
interference from the operating system and address space testing is maximized because a very small Linux
kernel is used.
Helpful Links:
HP Insight Diagnostics User Guide:
www.hp.com/servers/diags
Troubleshooting Flowcharts
The following troubleshooting flowcharts can be used as another tool for diagnosing specific DIMMs that
have failed. They can be used as a general guideline and should be used in conjunction with the other tools
and methods outlined in this white paper. Because a lot of system issues can be solved by upgrading
firmware, download the latest version of the firmware before proceeding with the troubleshooting methods
outlined in the following flowcharts. The latest firmware is available at the following URL:
General Troubleshooting Flowchart for a Bootable System Using HP insight
Diagnostics
11
General Troubleshooting Flowchart for a Non-Bootable System
Use this flowchart to help diagnose a memory problem in any of the following conditions:
• The system stops responding and displays a “parity error” message on the screen during boot.
• The system beeps and all of the DIMM LEDs illuminate during boot.
• The system is unable to boot far enough to run Offline Diagnostics, but no messages are displayed
that can identify the failing DIMM.
12
What Role Does Firmware Play in Troubleshooting Memory Problems?
Many product issues that result in hardware replacement, including issues in which memory problems are
suspected, are preventable or correctable with a firmware update. HP recommends checking for a firmware
update before sending a part back to HP for replacement. Based on the HP ProLiant product return rates, a
significant percentage of all returned hardware products were functioning properly and only needed a
firmware update. Although not all products fall into this category, server downtime and time spent removing,
returning, and ultimately replacing hardware may have been avoided if an attempt had been made to flash
the firmware during the troubleshooting process.
The following paragraphs contain information on each of HPs methods for updating firmware and provide
information on how to perform the updates.
Currently, there are two different methods for updating firmware on HP servers and options: the Online ROM
Flash and the Offline ROM Flash. Note that the Online ROM Flash is not currently available for all products. If
an Online ROM Flash is unavailable for a particular server or option, an Offline upgrade will need to be
performed.
13
TIP:
If a server is deployed more than three months after purchase, use the HP Support and Drivers
page on HP.COM, rather than the Firmware Maintenance CD that shipped with the server. In
addition, check for firmware updates for any options that may have been in stock but were not
deployed until later. This ensures that all server components are running the latest firmware versions.
The latest firmware upgrades are available at the following URL:
Updating Firmware Using the ONLINE ROM FLASH Method
The Online ROM Flash is an innovative technology developed by HP that allows the firmware to be upgraded
either locally or remotely via a downloadable file called a Smart Component. These Smart Components
enable the update to be performed while the server is operational, thereby avoiding costly server downtime.
Benefits of the Online ROM Flash include:
• The server does not have to be taken offline to perform the upgrade.
• The upgrade process takes less than a minute to complete.
• The server can be scheduled for a reboot at a later time to deploy the new firmware after the
upgrade process.
•The server administrator can remotely perform the upgrade to multiple servers at one time using the
ProLiant Remote Deployment Utility, the ProLiant Remote Deployment Console Utility, and other HP
server management technologies, such as HP Systems Insight Manager (HP SIM).
The Smart Component updates the firmware and configures the system so that the new settings will take effect
on the next reboot. This feature allows the update to be performed but gives the administrator control of when
the new settings are deployed.
For detailed information on the Online ROM Flash process, refer to the Online ROM Flash User Guide at the
following URL:
c00683436, Online ROM Flash Tool Available For Updating Firmware On ProLiant
Updating Firmware Using the OFFLINE ROM Flash Method
The Offline ROM Flash, as its name implies, is performed when the server is taken down for regular
maintenance. Although the results will be the same, the Offline ROM Flash does not provide the same benefits
of the new Online ROM Flash method. In addition, when upgrading remotely, the server administrator can
only update one server at a time.
There are two methods of performing an Offline ROM Flash. The firmware can be updated using a ROMPaq
Diskette or using the ROM Update Utility.
A ROMPaq is a floppy-disk based method of upgrade. The firmware is downloaded onto a floppy diskette
and then the system is booted to the floppy drive.
The ROM Update Utility is located on the Firmware Maintenance CD, or can be downloaded to a USB Drive
Key using the HP Drive Key Boot Utility.
Note: Hard Drive components can only be updated using the Offline method.
The HP Drive Key Boot Utility
The HP Drive Key Boot Utility can format an HP Drive Key so that it can be used as a bootable device. The
utility also provides the ability to load the ROM Update Utility on an HP Drive Key. After the ROM Update
Utility has been installed, the Offline ROM Flash Smart Components can be downloaded to the drive key from
the following URL and deployed using the ROM Update Utility:
The following white paper, Why Buy HP Memory, provides important information for customers who are
buying memory for HP servers; addresses important questions from HP customers and explains extensive
qualification and testing procedures and memory warranty issues. In addition, this white paper may answer
common customer questions about the impact of memory on system performance, reliability, and stability, and
describes the superior testing, certification, and warranty that differentiates HP-branded memory from other
memory modules on the market. Navigate to the following link to access the white paper described above:
HP does not support any memory that does not have the HP security label attached. Any DIMM that does not
have a valid HP security label attached is considered 3rd party and is not supported, regardless of vendor.
The standard memory warranty statement is posted at the following link:
The following Customer Advisories contain information about known memory issues and may be helpful in
providing solutions to specific memory-related problems. Refer to HP.COM for additional Customer Advisories
or sign up for Subscriber’s Choice and have advisories and notices for your specific products delivered
proactively:
http://www.hp.com/go/myadvisory
16
ProLiant Servers With Hot Add Memory Capability May Be Unable To Access Memory Above 4 GB
Because PAE Support Is Not Enabled In Microsoft Windows Server 2003:
First Edition of the HP ProLiant DL380 Generation 5 User Guide Incorrectly States that Advanced ECC
Memory Single Fully Buffered DIMM (FBDIMM) Mode is Supported:
Delays up to Several Minutes Before the HP ProLiant Logo Appears on the Monitor May Be Observed
After a ProLiant Server is Powered On or Rebooted if the Server is Configured with Many DIMMs:
Before replacing memory in ProLiant servers, ensure that accurate troubleshooting methods are used. Taking
the time to isolate specific DIMMs that are failing can prevent unnecessary hardware replacement and
customer dissatisfaction. Follow standard troubleshooting guidelines highlighted in this white paper and use
them each time a memory issue is suspected. Accurate problem diagnosis prevents customers from
experiencing unnecessary downtime.