What is memory scrubbing?............................................................................................................ 10
Detailed description of memory scrubbing......................................................................................... 10
Introduction
Advanced Memory Protection (AMP) consists of memory features that provide increased tolerance and
protection from memory failures. There are varying levels of AMP that are supported on ProLiant
servers, depending on the class of server. Refer to the product QuickSpecs for specific information on
the level of features supported on each ProLiant server.
AMP features include Advanced ECC, Online Spare Memory, Memory Mirroring and RAID.
Advanced ECC and Online Spare are supported on 300 series platforms. The focus of this
whitepaper is to detail Advanced ECC and Online Spare support for the 300 series platforms and
will cover how these features are enabled, the configuration rules for using these features, what
utilities can be used for monitoring failures, and how the failures can be repaired.
Memory failures defined
There are differing degrees of memory failures that impact the severity of the state of the server.
Memory errors can be classified into correctable errors and uncorrectable errors.
Correctable errors can be detected and corrected if the chipset and DIMM support this functionality.
Error detection and correction is implemented by storing data and ECC bits on the DIMM. By utilizing
the data and ECC bits, the system can detect memory errors and correct certain types of failures.
Correctable errors are generally single-bit errors. All ProLiant 300-series servers are capable of
detecting and correcting single-bit errors. In addition, ProLiant servers with Advanced ECC support
can detect and correct some multi-bit errors. HP’s Advanced ECC allows detection and correction of
multi-bit failures if all failed bits are contained within a single DRAM device on the DIMM.
Correctable errors can be classified as “hard” and “soft” errors. With a hard error, every access to
the memory location will return an error. A hard error typically indicates a problem with the DIMM.
With a soft error, the data and/or ECC bits on the DIMM are incorrect, but the error will not continue
to occur once the data and/or ECC bits on the DIMM have been corrected. Soft errors are typically
caused by cosmic rays. They are rare but expected occurrences.
Although hard correctable memory errors are corrected by the system and will not result in system
downtime or data corruption, they indicate a problem with the hardware. On the other hand, soft
errors do not indicate any issue with the hardware. Due to this, HP ProLiant servers track the rate of
correctable errors through correctable error thresholding. This allows the system to differentiate
between hard and soft errors. A soft error will not typically cause a DIMM to exceed HP’s correctable
error threshold. On the other hand, a hard error will typically cause a DIMM to exceed HP’s
correctable error threshold. Due to HP’s correctable error thresholding, the user is warned about hard
correctable errors, but is not notified about soft errors which don’t indicate any issue with the
hardware. HP suggests that corrective action be taken if a DIMM is receiving correctable errors at a
rate higher than HP’s correctable error threshold rate. Even though a DIMM has exceeded the
correctable threshold, future errors will continue to be corrected. The system will not shutdown or
crash due to additional correctable errors. However, a DIMM that is receiving correctable errors at a
high rate has a higher probability of receiving an uncorrectable error, which would result in a system
crash or shutdown for systems not configured for the Mirroring or RAID AMP modes.
The user is warned about a DIMM exceeding the correctable error threshold in multiple ways. The
systems internal Health LED will indicate a caution condition. On most ProLiant 300-series servers, an
LED next to the DIMM exceeding the threshold will be illuminated. In addition, if the System
Management Driver and agents are loaded, a message will be logged to both the console and
Systems Insight Manager. Correctable memory errors can typically be isolated to the actual failed
DIMM.
2
While correctable errors do not affect the normal operation of the system, uncorrectable memory
errors will immediately result in a system crash or shutdown of the system when not configured for
Mirroring or RAID AMP modes. Uncorrectable errors are detected by ProLiant 300-series servers, but
cannot be corrected. ProLiant 500-series and 700-series platforms with Mirroring or RAID AMP
support are capable of protecting against uncorrectable memory errors. Uncorrectable errors are
always multi-bit memory errors. For systems with Advanced ECC support, multi-bit memory errors
within the same DRAM device on the DIMM are not uncorrectable. However, if multiple bits are failed
on different DRAM devices on a DIMM, the error will be uncorrectable. When a system receives an
uncorrectable error and is not in an AMP mode providing protection against these errors, the system
will NMI. The internal Health LED will indicate a critical condition, and on most systems, the LEDs next
to the failed DIMMs will be illuminated. In addition, the error will be logged if the Systems
Management Driver is loaded. In certain cases (typically when the failed memory is in the first Bank of
memory), the NMI handler will be incapable of running because the memory where the NMI handler
resides will be corrupted. In these cases, the system will typically hard lock without any additional
indication regarding the failure. Uncorrectable memory errors can typically only be isolated down to
a failed Bank of DIMMs, rather than the DIMM itself.
Protection from memory failures
There are six levels of protection from memory errors that are supported by HP. In this whitepaper, the
focus will be on those levels of protection supported by the 300-series G4 class of servers. Each level
of protection requires server support.
The base level of memory protection available is parity protection. All ProLiant 300-series platforms
provide memory protection beyond that provided by parity. Parity can detect when a single-bit error
occurs, but cannot correct it. When a single-bit error occurs on a system with parity protection, the
system will hard lock causing a non-maskable interrupt (NMI). Thus, single-bit errors are uncorrectable
errors on a system with parity protection. In parity mode, there is no protection from any level of
memory failures because the ability to correct the failure does not exist.
The next level of protection is Standard ECC. Standard ECC requires chipset and DIMM level support
and provides the capability to detect and correct a single-bit error on a memory access. When a
single-bit error occurs, the system will detect the error and correct the data. Thus, the system will
continue to operate normally. With Standard ECC, all multi-bit memory errors will be detected, but not
corrected. Multi-bit errors are uncorrectable and will result in a system crash and NMI.
A more robust level of protection is provided by Advanced ECC, also known in the industry as
“Chipkill.” Advanced ECC requires chipset and DIMM support and provides a higher level of
protection over Standard ECC. Like Standard ECC, Advanced ECC will detect and correct single-bit
errors. However, Advanced ECC will also detect and correct multi-bit errors if all failed bits are within
a single DRAM device on the DIMM. An entire DRAM device on the DIMM can be failed, and the
system will continue to operate normally. If there are multiple bits of failure that occur on multiple
DRAM devices on the DIMM, the error cannot be corrected with Advanced ECC support, and the
system will crash and NMI.
HP offers memory protection beyond those features listed above. ProLiant 300-series servers support
Online Spare Mode. With Online Spare enabled, the system still takes advantage of Advanced ECC.
In Online Spare Mode, one bank of memory is designated as the spare bank. In this mode, the
designated bank is not used for total available system memory. If the correctable error threshold is
exceeded by a DIMM in a particular bank of memory, that bank will be taken offline and the spare
bank activated instead. Once the original bank is deactivated, the system will not utilize the memory
that exhibited the failure. After switching to the spare bank of memory, the system will continue to
monitor correctable threshold errors and log any failures. If an uncorrectable memory error occurs
before or after the online spare switchover, the system will crash and NMI. However, the memory
3
which exceeded the correctable threshold and was deactivated cannot result in an uncorrectable error
once the online spare switchover is complete.
Benefits of online spare memory
With Online Spare Memory, degraded memory is automatically disengaged and a fresh set of
memory is used in its place. This brings the reliability of the system to the pre-failure level without any
service interruption and without compromising system availability.
This solution is beneficial to businesses that do not have a permanent IT staff, do not have
replacement memory on hand, or cannot bring down the server for any reason until a scheduled
downtime. If a memory module has achieved its pre-defined threshold of correctable memory errors,
its chance of encountering uncorrectable errors increases dramatically. Online Spare allows the
system to automatically deactivate memory that is at a high risk of receiving an uncorrectable error,
and replace it with good memory. No interruption to system operation occurs. An uncorrectable error
would result in a system crash and unscheduled downtime. Thus, Online Spare Mode decreases the
chances of unscheduled downtime and system crashes due to uncorrectable memory errors.
Online Spare Memory is a higher level of memory protection that complements Advanced ECC
support. Online Spare Memory is a user selectable option. Users can choose to disable Online Spare
and make all installed memory available to the operating system and applications, or they can
choose to enable Online Spare and reduce the amount of memory available to the OS and
applications in return for a higher level of protection against uncorrectable memory errors. By default,
Online Spare Mode is disabled.
Deployment considerations
There are a few key factors to consider when determining what level of AMP support should be
enabled:
• What features are supported on the ProLiant server being deployed?
• What level of protection is desired?
• What the cost of implementation is for the AMP mode?
To determine what AMP features are supported on your ProLiant server, refer to the Product
QuickSpecs. The above sections detail the various protection modes and the benefits of each. The
cost of implementation for Online Spare over Advanced ECC is the hardware cost of the extra DIMMs
required for the spare bank. If Standard ECC or Advanced ECC is implemented, there is no cost
associated with extra hardware.
Implementation differences between ProLiant 300 series G3
and G4 servers
The implementation of AMP support for G4 300-series ProLiant servers is very similar to the
implementation on G3 servers with the following exceptions:
• The configuration rules have changes in regards to dual-ranked DIMMs (see “Configuration rules,”
below).
4
Loading...
+ 8 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.