Technical white paper
Memory RAS technologies for
HPE ProLiant and HPE Synergy Gen10 Plus Servers with Intel Xeon Scalable processors
Contents |
|
Introduction................................................................................................................................................................................................................................................................................................................................. |
2 |
Why is memory RAS needed?.................................................................................................................................................................................................................................................................................... |
2 |
Memory RAS technologies in HPE ProLiant and HPE Synergy servers ............................................................................................................................................................................. |
3 |
HPE Fast Fault Tolerance........................................................................................................................................................................................................................................................................................ |
3 |
Advanced ECC support.............................................................................................................................................................................................................................................................................................. |
4 |
Mirrored memory with advanced ECC support ................................................................................................................................................................................................................................... |
4 |
Memory scrubbing (patrol and demand).................................................................................................................................................................................................................................................. |
6 |
Conclusion.................................................................................................................................................................................................................................................................................................................................... |
7 |
Resource ........................................................................................................................................................................................................................................................................................................................................ |
7 |
Technical white paper |
Page 2 |
Memory device failures — if not corrected — can result in service events or even server crashes. As modern servers implement everlarger memory arrays, the likelihood of a memory device failure increases with higher memory capacity. Since memory device failures are some of the most frequent types of failure for servers besides storage failures, HPE ProLiant Gen10 Plus Servers using Intel® Xeon® Scalable processors provide an increasingly comprehensive suite of memory reliability, availability, and serviceability (RAS) features split into the following categories:
•Error detection and correction
•Redundancy and resiliency
•Maintenance
This paper provides a quick overview of selected memory RAS technologies for HPE ProLiant Gen10 Plus Servers, their characteristics, minimum requirements, and how to enable them. The information presented will help you select the most appropriate memory RAS technologies to meet your demanding workload and data center service-level requirements, especially for business-critical workloads.
Note
This paper focuses solely on server memory RAS features. It does not review the comprehensive suite of other RAS technologies found throughout the HPE ProLiant and HPE Synergy portfolios.
Server uptime is still one of the most critical aspects of data center maintenance. Unfortunately, servers can run into trouble from time to time due to software issues, power outages, or memory errors. The three major categories of memory errors we track and manage include correctable errors, uncorrectable errors, and recoverable errors. The determination of which errors are correctable and uncorrectable is completely dependent on the capability of the memory controller.
Correctable errors are, by definition, errors that can be detected and corrected by the chipset. Correctable errors are single-bit errors. All HPE servers can detect and correct single-bit errors with advanced error-correcting code (ECC) support. On HPE systems, the user is
warned about a DIMM exceeding the correctable error threshold (maximum amount of correctable errors tolerated in a certain time window) either through lights on the front panel or system board (if available) or the HPE Integrated Management Log (IML).
Uncorrectable errors are errors that can be detected but not corrected by the chipset. These are always multibit memory errors. The error is logged in the HPE IML. Uncorrectable errors can typically be isolated down to a single DIMM. Uncorrectable errors usually immediately result in a system crash or shutdown. In some cases, with OS support and advanced SKU processors (Intel® Xeon® Platinum and Intel® Xeon® Gold processors), uncorrectable errors do not result in a system crash. We call these recoverable errors. For error recovery details, check with your OS vendor for details.
DRAM errors come in two different types — hard errors and soft errors.
•Hard errors typically indicate a problem with the DIMM itself. Although hard correctable errors are corrected by the system and will not result in system downtime or data corruption, they still indicate a hardware problem. Hard errors typically cause a DIMM to exceed HPE systems’ correctable error threshold. The user is warned about those errors.
•Soft errors do not indicate any issues with the DIMM. They occur when the data and/or ECC bits on the DIMM are incorrect, but the error will not continue to occur once the data and/or ECC bits on the DIMM have been corrected. Soft errors will not typically cause a DIMM to exceed HPE systems’ correctable error threshold and therefore, no indication of a hardware issue is shown.
Any kind of error, if not handled correctly, can eventually cause a system shutdown. In the early days of servers, basic ECC was sufficient to resolve most DRAM failures. However, today’s servers present a completely different challenge, so additional RAS features are necessary to maintain expected server stability and uptime. It is important to note that by avoiding a critical failure, a system crash can be avoided. Failed memory devices are replaced as part of periodic service. Also, memory RAS technologies can detect a DRAM device on a DIMM that has had numerous soft errors and recommend replacing it before it has a hard failure.
Technical white paper |
Page 3 |
The following descriptions provide an overview of the functionality of selected memory RAS technologies.
HPE Fast Fault Tolerance is a new HPE memory RAS feature first introduced in HPE ProLiant Gen10 Plus Servers with Intel Xeon Scalable processors. Those servers configured with HPE SmartMemory and HPE Fast Fault Tolerance offer an extra layer of protection against planned server downtime and server crashes. HPE Fast Fault Tolerance, an enhanced version of adaptive double device data correction (ADDDC), is a result of a Hewlett Packard Enterprise and Intel® collaboration. HPE Fast Fault Tolerance has more spare regions (part of memory allocated only for replacing bad memory areas) and more options to map out bad sections of memory. This results in significantly better memory reliability and availability than what the rest of the industry can provide using ADDDC only.
In the past server generations, the most advanced memory protection technology in HPE ProLiant servers was double device data correction (DDDC). The biggest issue with this was that it had to be enabled at boot, and it significantly reduced memory throughput when enabled. Customers had to choose between resiliency and performance. HPE Fast Fault Tolerance provides significant improvement over DDDC because it incorporates the performance benefits of single device data correction (SDDC) with the availability of DDDC. HPE Fast Fault Tolerance allows the system to boot with full-memory performance and only puts small sections (banks) of memory into lockstep when needed to correct failures resulting in a significantly better performance than DDDC. When the failing section is larger than a bank, a larger negative impact on performance may be observed.
•HPE Fast Fault Tolerance survives up to two DRAM failures (detect and correct).
•The RAS feature combines the resiliency of DDDC with the performance of SDDC.
There must be a minimum of one rank on each populated channel. Furthermore, only HPE SmartMemory in x4 organization can be used.
HPE Fast Fault Tolerance is enabled by default for all workload profiles except for the low latency profile.
HPE Fast Fault Tolerance can be enabled or disabled on any HPE Gen10 Plus server through the RBSU or the HPE RESTful API. To change the default setting in the workload profile, first the desired workload profile should be selected and then changed to Custom. At that point, HPE Fast Fault Tolerance may be enabled or disabled accordingly through the Memory Options — Advanced Memory Protection menu.
HPE Fast Fault Tolerance configuration requirements vary for each server series, but it does not require OS support or special software beyond the basic input/output system (BIOS).
Figure 1. HPE Fast Fault Tolerance enabled in the RBSU
There will also be a minimal performance reduction in throughput if a DRAM fails but only in the typically small region (most common size is a bank) of memory that is affected. No significant loss is expected for random-access memory patterns because the region in lockstep will be accessed infrequently. The loss can be significant if you have rank level virtual lockstep or if an application accesses the region frequently until the DIMM is replaced. The overall reduction in throughput from HPE Fast Fault Tolerance is expected to be minimal for the vast majority of customers but does depend on the application, the size of the affected region, and the memory configuration.
Technical white paper |
Page 4 |
Standard ECC can correct single-bit memory errors and detect multibit memory errors. When multibit errors are detected using standard ECC, the error is signaled to the server and causes the server to halt.
Advanced ECC has been the default error correction scheme in HPE servers for over two decades. It not only protects servers against single-bit errors, but it also protects against some multibit memory errors — specifically those that occur within a single DRAM chip.
Advanced ECC can correct both single-bit memory errors and 4-bit memory errors if all failed bits are on the same DRAM device on the DIMM. Advanced ECC provides more protection than standard ECC because it is possible to correct certain memory errors that would otherwise be uncorrected and result in a server failure. Using HPE advanced memory error detection technology, the server provides notification when a DIMM is degrading and has a higher probability of an uncorrectable memory error.
There are no specific memory population rules or RBSU settings required for advanced ECC support. It’s enabled as the default on Intel Xeon Scalable processors platforms.
Advanced ECC support is the advanced memory protection mode default in the RBSU > Memory Options. Figure 2 shows the Advanced ECC support, which is an RBSU default feature.
Figure 2. Advanced ECC support, an RBSU default feature
Although advanced ECC provides failure protection, it can reliably correct multibit errors only when they occur within a single DRAM chip. Advanced ECC does not provide failover capability. As a result, if there is a memory failure, the system must be shut down before the memory can be replaced. The latest generation of HPE ProLiant and HPE Synergy servers using Intel Xeon Scalable processors offers three levels of advanced memory protection (including HPE Fast Fault Tolerance) that provide increased fault tolerance for applications requiring higher levels of availability.
Mirrored memory with advanced ECC support provides protection against uncorrectable errors that would otherwise result in system failure. There are two modes available — fully and partially mirrored memory support.
•Full mirrored memory support uses half of the system memory capacity to maintain one copy of all data.
•Partial mirrored memory support enables the user to assign a smaller amount of the system memory for mirroring. This feature is supported with advanced CPU SKUs for Intel Xeon Platinum and Gold processors.
If an uncorrectable error occurs in the mirrored memory protected region, the system automatically retrieves the good data from the redundant copy. The system continues to operate normally without any user intervention. By providing added redundancy in the memory subsystem, memory mirroring provides the greatest protection against memory failure not corrected by ECC, SDDC, DDDC, ADDDC, and online spare memory.
By enabling full mirrored memory, only half the populated memory is usable as system memory. Since full memory mirroring consumes 50% of the system memory capacity, it is targeted for server workloads that must receive the highest level of protection from memory device failures. You might want to consider memory mirroring for workloads that cannot have downtime and cannot risk waiting until scheduled downtime to replace degraded memory modules.
Technical white paper |
Page 5 |
Partial memory mirroring can be configured by the customer and supports different modes:
•OS configured
•First 4 GB of server memory
•10% or 20% of memory above 4 GB
For more information on partial memory mirroring support, check with your OS vendor for details.
The performance impact for implementing memory mirroring is generally small. Because partial memory mirroring uses less memory, the cost of implementation can be significantly lower than full memory mirroring.
The second generation of the Intel Xeon Scalable processor family supports four memory controllers per processor. Each memory controller supports two memory channels. When memory mirroring is enabled, the two channels attached to each memory controller become mirrored pairs. To enable mirroring, the mirrored channels must be populated identically. If DIMMs are populated on multiple channel pairs, the population of each pair can be different than the others — as long as the population is legal. Note that populating nonhomogeneous populations will have performance implications.
Figure 3 shows memory mirroring diagrams for HPE ProLiant Gen10 Plus Servers.
Figure 3. Memory mirroring diagrams for HPE ProLiant Gen10 Plus Servers
Partial memory mirroring follows the same loading rules as full mirroring supported on the platform.
Mirrored memory support can be enabled in the RBSU by configuring the advanced memory protection option to mirrored memory with advanced ECC. For full mirrored memory, the customer designates half of the memory banks as system memory and the remaining banks as mirrored memory. All banks must be configured identically.
Figure 4 shows the mirrored memory feature enabled in the RBSU.
Figure 4. Mirrored memory feature — enabled in the RBSU
Technical white paper |
Page 6 |
To configure partial memory mirroring, the advanced memory protection option has to be set to mirrored memory with advanced ECC and the memory mirroring mode to the appropriate setting as shown in Figure 5.
Figure 5. Partial memory mirroring — an advanced feature enabled on RBSU
Note
That Partial Mirror (OS configured) is only supported by some OS. Check with your OS vendor for details.
Memory scrubbing is a standard RAS memory feature designed to prevent soft errors from accumulating in memory and eventually becoming an uncorrected error. It does this by proactively writing correct data back to memory every time an error is detected. There are two types of scrubbing in today’s systems — patrol scrubbing and demand scrubbing. Both do the same thing; once an error is found, they correct it in memory. The significant difference is how the error is found. Patrol scrubbing is more of a proactive search for errors occurring continuously in the background while demand scrubbing occurs only when memory is read by the OS or the application.
When patrol scrubbing is enabled, it proactively searches the system memory for correctable errors and repairs them. This prevents the accumulation of single-bit errors that become uncorrectable when the correctable threshold error count is exceeded or degraded into multibit errors. There is one patrol scrubber per integrated memory controller (IMC).
There are no specific memory population rules or RBSU settings required for patrol scrubbing. It’s enabled as default on Intel Xeon Scalable processors platforms and can be turned off by the user. Demand scrubbing is always enabled by default and cannot be turned off.
The default is enabled for any advanced memory protection mode selected in RBSU > Memory Options.
Figure 6 shows patrol scrubbing being enabled in RBSU.
Figure 6. Patrol scrubbing — enabled in RBSU
Technical white paper
The BIOS enables the patrol scrubbing engine during boot and sets up the scrubbing interval. The scrubbing action involves:
•Reading every cache line once a day to check for errors.
•If errors are found, the correct data is written back to memory.
Patrol scrubs are intended to ensure that correctable errors do not remain in DRAM long enough to have a significant chance of combining with a transient error to cause an uncorrectable error. Patrol scrubbing works in all memory RAS modes, such as advanced ECC, mirroring, or rank sparing and helps reduce uncorrectable events.
The demand for servers with more memory capacity is unrelenting. It is driven by increasingly complex and memory-intensive applications and more powerful processors. While meeting the demand for more system memory, the challenge for server manufacturers is to maintain the reliability of the memory system even though there is a higher probability of memory errors as memory densities and capacities climb.
HPE is meeting the challenge with fault-tolerant memory protection technologies such as online spare memory, mirrored memory, and HPE Fast Fault Tolerance. Online spare memory is beneficial to customers with sites that cannot afford downtime from memory errors and yet can wait until a scheduled downtime to replace failed memory modules. Mirrored memory provides a higher level of availability, with a more fault-tolerant option providing full protection against single-bit and multibit errors. HPE Fast Fault Tolerance, the latest technology introduced in HPE ProLiant and HPE Synergy Gen10 Plus Servers using Intel Xeon Scalable processors, combines significantly better memory reliability and availability to the customer.
These HPE advanced memory protection technologies enable customers to choose a system with the level of memory availability they prefer to enhance the robustness of their final solution.
• HPE servers library
Learn more at
HPE.com/info/memory
Chat now (sales)
© Copyright 2023 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein.
Intel, Intel Xeon, Intel Xeon Gold, and Intel Xeon Platinum are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. All third-party marks are property of their respective owners.
a50004620ENW, Rev. 1