HP PROLIANT DL360 G4, PROLIANT DL380 G4, PROLIANT ML370 G4 User Manual

Advanced memory protection for HP ProLiant 300 series G4 servers
Introduction......................................................................................................................................... 2
Protection from memory failures............................................................................................................. 3
Benefits of online spare memory ............................................................................................................ 4
Deployment considerations ................................................................................................................... 4
Implementation differences between ProLiant 300 series G3 and G4 servers .............................................. 4
Dual rank vs. single rank DIMMs ........................................................................................................... 5
Configuration rules for online spare ....................................................................................................... 5
Configuring the AMP mode................................................................................................................... 6
Managing failures in a system with online spare enabled.........................................................................8
Symmetric memory mode.................................................................................................................... 10
Memory Scrubbing ............................................................................................................................ 10
What is memory scrubbing?............................................................................................................ 10
Detailed description of memory scrubbing......................................................................................... 10
Advanced Memory Protection (AMP) consists of memory features that provide increased tolerance and protection from memory failures. There are varying levels of AMP that are supported on ProLiant servers, depending on the class of server. Refer to the product QuickSpecs for specific information on the level of features supported on each ProLiant server.
AMP features include Advanced ECC, Online Spare Memory, Memory Mirroring and RAID. Advanced ECC and Online Spare are supported on 300 series platforms. The focus of this whitepaper is to detail Advanced ECC and Online Spare support for the 300 series platforms and will cover how these features are enabled, the configuration rules for using these features, what utilities can be used for monitoring failures, and how the failures can be repaired.

Memory failures defined

There are differing degrees of memory failures that impact the severity of the state of the server. Memory errors can be classified into correctable errors and uncorrectable errors.
Correctable errors can be detected and corrected if the chipset and DIMM support this functionality. Error detection and correction is implemented by storing data and ECC bits on the DIMM. By utilizing the data and ECC bits, the system can detect memory errors and correct certain types of failures. Correctable errors are generally single-bit errors. All ProLiant 300-series servers are capable of detecting and correcting single-bit errors. In addition, ProLiant servers with Advanced ECC support can detect and correct some multi-bit errors. HP’s Advanced ECC allows detection and correction of multi-bit failures if all failed bits are contained within a single DRAM device on the DIMM.
Correctable errors can be classified as “hard” and “soft” errors. With a hard error, every access to the memory location will return an error. A hard error typically indicates a problem with the DIMM. With a soft error, the data and/or ECC bits on the DIMM are incorrect, but the error will not continue to occur once the data and/or ECC bits on the DIMM have been corrected. Soft errors are typically caused by cosmic rays. They are rare but expected occurrences.
Although hard correctable memory errors are corrected by the system and will not result in system downtime or data corruption, they indicate a problem with the hardware. On the other hand, soft errors do not indicate any issue with the hardware. Due to this, HP ProLiant servers track the rate of correctable errors through correctable error thresholding. This allows the system to differentiate between hard and soft errors. A soft error will not typically cause a DIMM to exceed HP’s correctable error threshold. On the other hand, a hard error will typically cause a DIMM to exceed HP’s correctable error threshold. Due to HP’s correctable error thresholding, the user is warned about hard correctable errors, but is not notified about soft errors which don’t indicate any issue with the hardware. HP suggests that corrective action be taken if a DIMM is receiving correctable errors at a rate higher than HP’s correctable error threshold rate. Even though a DIMM has exceeded the correctable threshold, future errors will continue to be corrected. The system will not shutdown or crash due to additional correctable errors. However, a DIMM that is receiving correctable errors at a high rate has a higher probability of receiving an uncorrectable error, which would result in a system crash or shutdown for systems not configured for the Mirroring or RAID AMP modes.
The user is warned about a DIMM exceeding the correctable error threshold in multiple ways. The systems internal Health LED will indicate a caution condition. On most ProLiant 300-series servers, an LED next to the DIMM exceeding the threshold will be illuminated. In addition, if the System Management Driver and agents are loaded, a message will be logged to both the console and Systems Insight Manager. Correctable memory errors can typically be isolated to the actual failed DIMM.
2
While correctable errors do not affect the normal operation of the system, uncorrectable memory errors will immediately result in a system crash or shutdown of the system when not configured for Mirroring or RAID AMP modes. Uncorrectable errors are detected by ProLiant 300-series servers, but cannot be corrected. ProLiant 500-series and 700-series platforms with Mirroring or RAID AMP support are capable of protecting against uncorrectable memory errors. Uncorrectable errors are always multi-bit memory errors. For systems with Advanced ECC support, multi-bit memory errors within the same DRAM device on the DIMM are not uncorrectable. However, if multiple bits are failed on different DRAM devices on a DIMM, the error will be uncorrectable. When a system receives an uncorrectable error and is not in an AMP mode providing protection against these errors, the system will NMI. The internal Health LED will indicate a critical condition, and on most systems, the LEDs next to the failed DIMMs will be illuminated. In addition, the error will be logged if the Systems Management Driver is loaded. In certain cases (typically when the failed memory is in the first Bank of memory), the NMI handler will be incapable of running because the memory where the NMI handler resides will be corrupted. In these cases, the system will typically hard lock without any additional indication regarding the failure. Uncorrectable memory errors can typically only be isolated down to a failed Bank of DIMMs, rather than the DIMM itself.

Protection from memory failures

There are six levels of protection from memory errors that are supported by HP. In this whitepaper, the focus will be on those levels of protection supported by the 300-series G4 class of servers. Each level of protection requires server support.
The base level of memory protection available is parity protection. All ProLiant 300-series platforms provide memory protection beyond that provided by parity. Parity can detect when a single-bit error occurs, but cannot correct it. When a single-bit error occurs on a system with parity protection, the system will hard lock causing a non-maskable interrupt (NMI). Thus, single-bit errors are uncorrectable errors on a system with parity protection. In parity mode, there is no protection from any level of memory failures because the ability to correct the failure does not exist.
The next level of protection is Standard ECC. Standard ECC requires chipset and DIMM level support and provides the capability to detect and correct a single-bit error on a memory access. When a single-bit error occurs, the system will detect the error and correct the data. Thus, the system will continue to operate normally. With Standard ECC, all multi-bit memory errors will be detected, but not corrected. Multi-bit errors are uncorrectable and will result in a system crash and NMI.
A more robust level of protection is provided by Advanced ECC, also known in the industry as “Chipkill.” Advanced ECC requires chipset and DIMM support and provides a higher level of protection over Standard ECC. Like Standard ECC, Advanced ECC will detect and correct single-bit errors. However, Advanced ECC will also detect and correct multi-bit errors if all failed bits are within a single DRAM device on the DIMM. An entire DRAM device on the DIMM can be failed, and the system will continue to operate normally. If there are multiple bits of failure that occur on multiple DRAM devices on the DIMM, the error cannot be corrected with Advanced ECC support, and the system will crash and NMI.
HP offers memory protection beyond those features listed above. ProLiant 300-series servers support Online Spare Mode. With Online Spare enabled, the system still takes advantage of Advanced ECC. In Online Spare Mode, one bank of memory is designated as the spare bank. In this mode, the designated bank is not used for total available system memory. If the correctable error threshold is exceeded by a DIMM in a particular bank of memory, that bank will be taken offline and the spare bank activated instead. Once the original bank is deactivated, the system will not utilize the memory that exhibited the failure. After switching to the spare bank of memory, the system will continue to monitor correctable threshold errors and log any failures. If an uncorrectable memory error occurs before or after the online spare switchover, the system will crash and NMI. However, the memory
3
which exceeded the correctable threshold and was deactivated cannot result in an uncorrectable error once the online spare switchover is complete.

Benefits of online spare memory

With Online Spare Memory, degraded memory is automatically disengaged and a fresh set of memory is used in its place. This brings the reliability of the system to the pre-failure level without any service interruption and without compromising system availability.
This solution is beneficial to businesses that do not have a permanent IT staff, do not have replacement memory on hand, or cannot bring down the server for any reason until a scheduled downtime. If a memory module has achieved its pre-defined threshold of correctable memory errors, its chance of encountering uncorrectable errors increases dramatically. Online Spare allows the system to automatically deactivate memory that is at a high risk of receiving an uncorrectable error, and replace it with good memory. No interruption to system operation occurs. An uncorrectable error would result in a system crash and unscheduled downtime. Thus, Online Spare Mode decreases the chances of unscheduled downtime and system crashes due to uncorrectable memory errors.
Online Spare Memory is a higher level of memory protection that complements Advanced ECC support. Online Spare Memory is a user selectable option. Users can choose to disable Online Spare and make all installed memory available to the operating system and applications, or they can choose to enable Online Spare and reduce the amount of memory available to the OS and applications in return for a higher level of protection against uncorrectable memory errors. By default, Online Spare Mode is disabled.

Deployment considerations

There are a few key factors to consider when determining what level of AMP support should be enabled:
What features are supported on the ProLiant server being deployed?
What level of protection is desired?
What the cost of implementation is for the AMP mode?
To determine what AMP features are supported on your ProLiant server, refer to the Product QuickSpecs. The above sections detail the various protection modes and the benefits of each. The cost of implementation for Online Spare over Advanced ECC is the hardware cost of the extra DIMMs required for the spare bank. If Standard ECC or Advanced ECC is implemented, there is no cost associated with extra hardware.

Implementation differences between ProLiant 300 series G3 and G4 servers

The implementation of AMP support for G4 300-series ProLiant servers is very similar to the implementation on G3 servers with the following exceptions:
The configuration rules have changes in regards to dual-ranked DIMMs (see “Configuration rules,” below).
4
Loading...
+ 8 hidden pages