Hp PROLIANT BL20P G3, PROLIANT ML110, PROLIANT ML570 G3, PROLIANT ML150 G3, PROLIANT DL560 System Memory Troubleshooting Best Practices

...
System Memory Troubleshooting Best Practices for HP ProLiant Servers
Accurately troubleshooting system memory issues in ProLiant server configurations is an important process that can help prevent unnecessary replacement of hardware components. In addition, accurate problem diagnosis prevents customers from experiencing unnecessary downtime while waiting for hardware that may not need to be replaced. Following standard troubleshooting guidelines and using them each time a memory issue is suspected helps to establish this.
HP has developed several methods for troubleshooting memory problems in ProLiant servers.
The purpose of this white paper is to assist HP customers in troubleshooting system memory problems by successfully isolating the specific DIMMs causing the problem. This helps to prevent nonessential replacement of unaffected DIMMs or, in some cases, entire banks of memory. In addition, effective troubleshooting can help determine if a firmware or other software download can resolve a problem without replacing hardware.
This white paper covers the following topics:
Why should I troubleshoot every system memory problem?
How can I tell if a memory problem has occurred?
What tools are available from HP to help identify a failing DIMM?
Troubleshooting using HP Insight Diagnostics Online Edition
Troubleshooting using HP Insight Diagnostics Offline Edition
Troubleshooting flowchart for bootable systems
Troubleshooting flowchart for non-bootable systems
What
Why Buy HP Memory?
Other troubleshooting resources
role does firmware play in solving system memory problems?

Why Should I Troubleshoot Every Memory Problem?

Accurate diagnosis of system memory problems in ProLiant servers has many advantages, including:
Prevents unnecessary hardware replacement.
Prevents the return of parts that test NFF (No Fault Found).
Prevents server downtime.
Best Practice
correctable with a firmware update. HP strongly recommends checking for a firmware update before sending a part back to HP for replacement. Based on the HP ProLiant product return rates, a significant percentage of all returned hardware products were functioning properly and only needed a firmware update. Although not all products fall into this category, server downtime and time spent removing, returning, and ultimately replacing hardware may have been avoided if an attempt had been made to flash the firmware during the troubleshooting process.

How can I tell if a Memory Problem has Occurred?

: Many product issues that result in hardware replacement are preventable or
There are many indicators that a problem has occurred within the memory subsystem. HP has several tools used to identify the status of hardware and software within a system. Using these tools is a good first step in the troubleshooting process. When a system memory problem is suspected, check one or all of these common places to find information:
HP System Management Homepage (SMH)
HP Systems Insight Manager (HP SIM)
Server Logs
DIMM Slot LEDs
IMPORTANT:
When a memory error is detected, the firmware illuminates the fault LEDs located near each DIMM slot on the system board.
If the system identifies an error to a specific slot, that LED illuminates. However, if the system can only identify an error within a bank, but cannot isolate a specific DIMM, all the LEDs in the bank will illuminate.
In addition, if the system cannot identify the bank in which the error has occurred, all the LEDs in all
2
banks illuminate, making the task of isolating the failing DIMM more difficult and the chance of replacing functioning banks of memory more likely.
Therefore, further troubleshooting is necessary to determine which specific DIMM is failing. Use the LEDs as a tool in identifying that a memory problem may exist, but do not rely solely on the status of the LEDs to determine if hardware should be replaced.
What Tools are Available from HP to Help Identify a Failing DIMM?
Refer to the any of following HP system tools whenever a memory problem is suspected.
HP System Management Homepage
The HP System Management Homepage supplies a consolidated view of system hardware health, configuration, performance and status information for individual HP servers. Details are provided on total system health, including system memory. Information on system memory can be found under the “Performance” section on the main page (See Figure 1). For Linux and Windows, HP SMH is available in the ProLiant Support Pack and Integrity Support Pack. To download the latest version of the ProLiant Support Pack or Integrity Support Pack, navigate to the Support and Troubleshooting link on
http://www.hp.com.
3
Figure 1: Overview of the HP System Management Homepage

HP Systems Insight Manager (HP SIM)

HP Systems Insight Manager monitors the health of the hardware in the system and polls installed hardware for its status every few minutes. Refer to Figure 2 below for an example of events displayed on the System page. For more information on HP SIM, refer to the following URL:
http://h18013.www1.hp.com/products/servers/management/hpsim/index.html
4
Figure 2: HP Systems Insight Manager
5

System Logs

erver system logs record the status of hardware events, including memory issues. For servers running
S Microsoft Windows operating systems, either of the following logs can be a valuable resource:
Integrated Management Log (IML)
Event Viewer
r servers running Linux operating systems, refer to either of the following:
Fo
Integrated Management Log (IML)
varlog/messages file
icrosoft Windows Operating Systems: Using the IML
M
e IML Viewer is a software tool created by HP and is available as a downloadable component pack from
Th HP.COM. It can also be accessed via the HP System Management Homepage (SMH). Navigate to this tool through SMH by clicking on the Logs tab or through the operating system from HP System Tools. Figure 3 below shows the Integrated Management Log accessed via SMH. System memory issues, if present, will b recorded and will be visible in the IML.
e
Figure 3: Integrated Management Log
6

Microsoft Windows Operating Systems: Using Event Viewer

e Event Viewer is a software tool available as part of Microsoft Windows operating systems. It can be
Th accessed by navigating to HP System Tools via the Start menu. Figure 4 shows an example of server even that are logged in this tool.
ts
Figure 4: Event Viewer (Microsoft Windows Operating Systems)
nux Operating Systems: Using the IML
Li
e IML Viewer is a software tool created by HP and is available as a downloadable component pack from
Th HP.COM. For systems running Linux, type the command hplog –v to view the IML log and check for system memory errors. From the Linux Command Prompt, type “hplog –v” and the entire IML will be displayed. Any detected system memory error will be logged, including any pre-failure warranty memory events on systems where applicable.
7

Linux Operating Systems: Using /var/log/messages File

The Linux system log (/var/log/messages) can be viewed using the cat, more and less commands. The following types of messages may be logged here if a memory problem has occurred:
Oct 11 01:51:18 dhcp57-150 hpasmd[12039]: WARNING: hpasmd: Corrected Memory Error threshold exceeded (Slot 3, Memory Module 6)
[root@dhcp57-150 ~]# hplog -v
ID Severity Initial Time Update Time Count
------------------------------------------------------------­0011 Information 22:50 09/16/2004 22:50 09/16/2004 0001 LOG: IML Cleared (Administrator)
0012 Repaired 00:32 09/21/2004 00:32 09/21/2004 0001 LOG: Memory Cartridge Not Redundant (Slot 5)
0033 Caution 04:02 10/06/2006 04:02 10/06/2006 0001 LOG: Corrected Memory Error threshold exceeded (Slot 3, Memory Module 6)

DIMM Slot LEDs as Memory Problem Indicators

DIMM slot LEDs can be helpful indicators of issues with system memory. Figure 5 below gives an example of the location of LEDs that will illuminate when certain DIMM slots indicate a failure. The example below is from a ProLiant DL380 G4 system board; however, similar graphics for specific systems can be found in server user guides and other system documentation.
Memory Problem Indicators
DIMM failure
5
slot 6C
DIMM failure
6
slot 5C
DIMM failure
7
slot 4B
DIMM failure
8
slot 3B
Example DL 380 G4
DIMM Slot LEDs on the System Board
9/5/2006 12
Figure 5: DiMM Slot LEDs
DIMM failure
9
1
DIMM failure
0
slot 2A
slot 1A
Amber = Memory
failed
Off = Normal
Amber = Memory
failed
Off = Normal
Amber = Memory
failed
Off = Normal
Amber = Memory
failed
Off = Normal
Amber = Memory
failed
Off = Normal
Amber = Memory
failed
Off = Normal
8
Once a system memory problem is suspected based on one of the methods outlined above, the first step is to schedule downtime for troubleshooting. HP recommends using HP Insight Diagnostics, available on the HP SmartStart CD, to begin the troubleshooting process.
HP Insight Diagnostics is a proactive server management tool that is available in offline and online editions. It provides diagnostics and troubleshooting capabilities to help locate system and component problems.
Features of HP Insight Diagnostics:
Works on multiple operating systems
Can be used offline or online
HP Insight Management Agents and the System Management driver can be leveraged
Common Diagnostic Model (CDM) compliance.
For more information on HP Insight Diagnostics, refer to the following URL:
http://h18013.www1.hp.com/products/servers/management/hpid/index.html

Troubleshooting Using HP Insight Diagnostics Online Edition

HP Insight Diagnostics Online edition is available in HP ProLiant Support Packs.
The Insight Diagnostics Online edition performs various non-intrusive in-depth system and component diagnoses while the operating system is running. The “survey feature” can identify and resolve problems without taking the system down.

Troubleshooting Using HP Insight Diagnostics Offline Edition

Insight Diagnostics Offline edition is available in HP SmartStart by choosing “Launch server diagnostics.” Some benefits of HP Insight Diagnostics are:
Performs extensive in-depth system and component testing while in a controlled operating
environment.
Survey feature enables IT administrators to track hardware and software changes in order to form a
complete and thorough auditing process for the system.
Test results can be analyzed by IT administrators to diagnose system and component problems in
order to repair and return servers back to the production environment.
9
Online Vs. Offline Testing
Offline is the preferred testing method because it is the most accurate. When testing offline, there is minimal interference from the operating system and address space testing is maximized because a very small Linux kernel is used.
Helpful Links:
HP Insight Diagnostics User Guide:
www.hp.com/servers/diags

Troubleshooting Flowcharts

The following troubleshooting flowcharts can be used as another tool for diagnosing specific DIMMs that have failed. They can be used as a general guideline and should be used in conjunction with the other tools and methods outlined in this white paper. Because a lot of system issues can be solved by upgrading firmware, download the latest version of the firmware before proceeding with the troubleshooting methods outlined in the following flowcharts. The latest firmware is available at the following URL:
http://h18023.www1.hp.com/support/files/server/us/romtabl.html
10
General Troubleshooting Flowchart for a Bootable System Using HP insight Diagnostics
11

General Troubleshooting Flowchart for a Non-Bootable System

Use this flowchart to help diagnose a memory problem in any of the following conditions:
The system stops responding and displays a “parity error” message on the screen during boot.
The system beeps and all of the DIMM LEDs illuminate during boot.
The system is unable to boot far enough to run Offline Diagnostics, but no messages are displayed
that can identify the failing DIMM.
12
What Role Does Firmware Play in Troubleshooting Memory Problems?
Many product issues that result in hardware replacement, including issues in which memory problems are suspected, are preventable or correctable with a firmware update. HP recommends checking for a firmware update before sending a part back to HP for replacement. Based on the HP ProLiant product return rates, a significant percentage of all returned hardware products were functioning properly and only needed a firmware update. Although not all products fall into this category, server downtime and time spent removing, returning, and ultimately replacing hardware may have been avoided if an attempt had been made to flash the firmware during the troubleshooting process.
The following paragraphs contain information on each of HPs methods for updating firmware and provide information on how to perform the updates.
Currently, there are two different methods for updating firmware on HP servers and options: the Online ROM Flash and the Offline ROM Flash. Note that the Online ROM Flash is not currently available for all products. If an Online ROM Flash is unavailable for a particular server or option, an Offline upgrade will need to be performed.
13
TIP:
If a server is deployed more than three months after purchase, use the HP Support and Drivers page on HP.COM, rather than the Firmware Maintenance CD that shipped with the server. In addition, check for firmware updates for any options that may have been in stock but were not deployed until later. This ensures that all server components are running the latest firmware versions.
The latest firmware upgrades are available at the following URL:
http://h18023.www1.hp.com/support/files/server/us/locate/8641.html

Updating Firmware Using the ONLINE ROM FLASH Method

The Online ROM Flash is an innovative technology developed by HP that allows the firmware to be upgraded either locally or remotely via a downloadable file called a Smart Component. These Smart Components enable the update to be performed while the server is operational, thereby avoiding costly server downtime.
Benefits of the Online ROM Flash include:
The server does not have to be taken offline to perform the upgrade.
The upgrade process takes less than a minute to complete.
The server can be scheduled for a reboot at a later time to deploy the new firmware after the
upgrade process.
The server administrator can remotely perform the upgrade to multiple servers at one time using the
ProLiant Remote Deployment Utility, the ProLiant Remote Deployment Console Utility, and other HP server management technologies, such as HP Systems Insight Manager (HP SIM).
The Smart Component updates the firmware and configures the system so that the new settings will take effect on the next reboot. This feature allows the update to be performed but gives the administrator control of when the new settings are deployed.
For detailed information on the Online ROM Flash process, refer to the Online ROM Flash User Guide at the following URL:
http://www.compaq.com/support/files/server/us/webdoc/rom/OnlineROMFlashUserGuide.pdf
Smart Components for HP ProLiant servers and storage can be obtained from the following links:
Microsoft Windows
Operating Systems: http://h18023.www1.hp.com/support/files/server/us/linuxrom.html
Linux
Operating Systems: http://h18023.www1.hp.com/support/files/server/us/winroms.html
14
For more information on how to deploy firmware updates remotely, refer to the following URL:
http://h18004.www1.hp.com/products/servers/management/im/index.html
Customer Advisory
Servers And Options, is available at:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?locale=en_US&objectID=c00683436
c00683436, Online ROM Flash Tool Available For Updating Firmware On ProLiant

Updating Firmware Using the OFFLINE ROM Flash Method

The Offline ROM Flash, as its name implies, is performed when the server is taken down for regular maintenance. Although the results will be the same, the Offline ROM Flash does not provide the same benefits of the new Online ROM Flash method. In addition, when upgrading remotely, the server administrator can only update one server at a time.
There are two methods of performing an Offline ROM Flash. The firmware can be updated using a ROMPaq Diskette or using the ROM Update Utility.
A ROMPaq is a floppy-disk based method of upgrade. The firmware is downloaded onto a floppy diskette and then the system is booted to the floppy drive.
The ROM Update Utility is located on the Firmware Maintenance CD, or can be downloaded to a USB Drive Key using the HP Drive Key Boot Utility.
Note: Hard Drive components can only be updated using the Offline method.

The HP Drive Key Boot Utility

The HP Drive Key Boot Utility can format an HP Drive Key so that it can be used as a bootable device. The utility also provides the ability to load the ROM Update Utility on an HP Drive Key. After the ROM Update Utility has been installed, the Offline ROM Flash Smart Components can be downloaded to the drive key from the following URL and deployed using the ROM Update Utility:
http://h18023.www1.hp.com/support/files/server/us/smartstartGP.html
System ROM support is required for the HP Drive Key Utility. To determine server support, refer to the following URL:
http://h18004.www1.hp.com/products/servers/platforms/usb-support.html
For additional information on the HP Drive Key Boot Utility, refer to the following URL:
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00218060/c00218060.pdf
15
The HP Drive Key Utility can be downloaded from the following URL:
http://www.compaq.com/support/files/server/us/locate/8641.html

Why Buy HP Memory?

The following white paper, Why Buy HP Memory, provides important information for customers who are buying memory for HP servers; addresses important questions from HP customers and explains extensive qualification and testing procedures and memory warranty issues. In addition, this white paper may answer common customer questions about the impact of memory on system performance, reliability, and stability, and describes the superior testing, certification, and warranty that differentiates HP-branded memory from other memory modules on the market. Navigate to the following link to access the white paper described above:
http://h71028.www7.hp.com/ERC/downloads/4AA0-4216ENW.pdf
HP does not support any memory that does not have the HP security label attached. Any DIMM that does not have a valid HP security label attached is considered 3rd party and is not supported, regardless of vendor.
The standard memory warranty statement is posted at the following link:
http://h18004.www1.hp.com/products/servers/platforms/warranty/index.html
For additional information on HP ProLiant Server memory, navigate to the following URL:
http://h18004.www1.hp.com/products/servers/options/memory­description.html?jumpid=reg_R1002_USEN
For information on Advanced Memory Protection, navigate to the following URL:
http://h18004.www1.hp.com/products/servers/technology/memoryprotection.html
ProLiant memory architecture white papers are available at the following URL:
http://h18004.www1.hp.com/products/servers/technology/whitepapers/adv-technology.html#2

Other Troubleshooting Resources

The HP ProLiant Troubleshooting Guide is located at the following URL and may be helpful in diagnosing memory or other system problems:
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00300504/c00300504.pdf
The following Customer Advisories contain information about known memory issues and may be helpful in providing solutions to specific memory-related problems. Refer to HP.COM for additional Customer Advisories or sign up for Subscriber’s Choice and have advisories and notices for your specific products delivered proactively:
http://www.hp.com/go/myadvisory
16
ProLiant Servers With Hot Add Memory Capability May Be Unable To Access Memory Above 4 GB Because PAE Support Is Not Enabled In Microsoft Windows Server 2003:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&objectID=PSD_EM04 0802_CW01&jumpid=reg_R1002_USEN
First Edition of the HP ProLiant DL380 Generation 5 User Guide Incorrectly States that Advanced ECC Memory Single Fully Buffered DIMM (FBDIMM) Mode is Supported:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?locale=en_US&objectID=c00712235
SYSTEM ROM UPGRADE RECOMMENDED to Ensure Proper Support for all Fully Buffered DIMMs Installed in Certain ProLiant Servers:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?locale=en_US&objectID=c00805126
Delays up to Several Minutes Before the HP ProLiant Logo Appears on the Monitor May Be Observed After a ProLiant Server is Powered On or Rebooted if the Server is Configured with Many DIMMs:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?locale=en_US&objectID=c00776039
Additional Troubleshooting is Required to Locate a Specific DIMM on ProLiant Servers when Multiple DIMM Slot Fault LEDs are Illuminated:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?locale=en_US&objectID=c00713352
HP ProLiant DL145 G2 Server Configured with Dual Processors May Not Report Memory Installed in DIMM Slots 5 Through 8:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?locale=en_US&objectID=c00633786

Summary

Before replacing memory in ProLiant servers, ensure that accurate troubleshooting methods are used. Taking the time to isolate specific DIMMs that are failing can prevent unnecessary hardware replacement and customer dissatisfaction. Follow standard troubleshooting guidelines highlighted in this white paper and use them each time a memory issue is suspected. Accurate problem diagnosis prevents customers from experiencing unnecessary downtime.
17
© 2007 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.
446969-002, 1/2007
Loading...