Technical white paper
Memory RAS technologies for
HPE ProLiant and HPE Synergy Gen10 Plus Servers with Intel Xeon Scalable processors
Contents |
|
Introduction................................................................................................................................................................................................................................................................................................................................. |
2 |
Why is memory RAS needed?.................................................................................................................................................................................................................................................................................... |
2 |
Memory RAS technologies in HPE ProLiant and HPE Synergy servers ............................................................................................................................................................................. |
3 |
HPE Fast Fault Tolerance........................................................................................................................................................................................................................................................................................ |
3 |
Advanced ECC support.............................................................................................................................................................................................................................................................................................. |
4 |
Mirrored memory with advanced ECC support ................................................................................................................................................................................................................................... |
4 |
Memory scrubbing (patrol and demand).................................................................................................................................................................................................................................................. |
6 |
Conclusion.................................................................................................................................................................................................................................................................................................................................... |
7 |
Resource ........................................................................................................................................................................................................................................................................................................................................ |
7 |
Technical white paper |
Page 2 |
Memory device failures — if not corrected — can result in service events or even server crashes. As modern servers implement everlarger memory arrays, the likelihood of a memory device failure increases with higher memory capacity. Since memory device failures are some of the most frequent types of failure for servers besides storage failures, HPE ProLiant Gen10 Plus Servers using Intel® Xeon® Scalable processors provide an increasingly comprehensive suite of memory reliability, availability, and serviceability (RAS) features split into the following categories:
•Error detection and correction
•Redundancy and resiliency
•Maintenance
This paper provides a quick overview of selected memory RAS technologies for HPE ProLiant Gen10 Plus Servers, their characteristics, minimum requirements, and how to enable them. The information presented will help you select the most appropriate memory RAS technologies to meet your demanding workload and data center service-level requirements, especially for business-critical workloads.
Note
This paper focuses solely on server memory RAS features. It does not review the comprehensive suite of other RAS technologies found throughout the HPE ProLiant and HPE Synergy portfolios.
Server uptime is still one of the most critical aspects of data center maintenance. Unfortunately, servers can run into trouble from time to time due to software issues, power outages, or memory errors. The three major categories of memory errors we track and manage include correctable errors, uncorrectable errors, and recoverable errors. The determination of which errors are correctable and uncorrectable is completely dependent on the capability of the memory controller.
Correctable errors are, by definition, errors that can be detected and corrected by the chipset. Correctable errors are single-bit errors. All HPE servers can detect and correct single-bit errors with advanced error-correcting code (ECC) support. On HPE systems, the user is
warned about a DIMM exceeding the correctable error threshold (maximum amount of correctable errors tolerated in a certain time window) either through lights on the front panel or system board (if available) or the HPE Integrated Management Log (IML).
Uncorrectable errors are errors that can be detected but not corrected by the chipset. These are always multibit memory errors. The error is logged in the HPE IML. Uncorrectable errors can typically be isolated down to a single DIMM. Uncorrectable errors usually immediately result in a system crash or shutdown. In some cases, with OS support and advanced SKU processors (Intel® Xeon® Platinum and Intel® Xeon® Gold processors), uncorrectable errors do not result in a system crash. We call these recoverable errors. For error recovery details, check with your OS vendor for details.
DRAM errors come in two different types — hard errors and soft errors.
•Hard errors typically indicate a problem with the DIMM itself. Although hard correctable errors are corrected by the system and will not result in system downtime or data corruption, they still indicate a hardware problem. Hard errors typically cause a DIMM to exceed HPE systems’ correctable error threshold. The user is warned about those errors.
•Soft errors do not indicate any issues with the DIMM. They occur when the data and/or ECC bits on the DIMM are incorrect, but the error will not continue to occur once the data and/or ECC bits on the DIMM have been corrected. Soft errors will not typically cause a DIMM to exceed HPE systems’ correctable error threshold and therefore, no indication of a hardware issue is shown.
Any kind of error, if not handled correctly, can eventually cause a system shutdown. In the early days of servers, basic ECC was sufficient to resolve most DRAM failures. However, today’s servers present a completely different challenge, so additional RAS features are necessary to maintain expected server stability and uptime. It is important to note that by avoiding a critical failure, a system crash can be avoided. Failed memory devices are replaced as part of periodic service. Also, memory RAS technologies can detect a DRAM device on a DIMM that has had numerous soft errors and recommend replacing it before it has a hard failure.
Technical white paper |
Page 3 |
The following descriptions provide an overview of the functionality of selected memory RAS technologies.
HPE Fast Fault Tolerance is a new HPE memory RAS feature first introduced in HPE ProLiant Gen10 Plus Servers with Intel Xeon Scalable processors. Those servers configured with HPE SmartMemory and HPE Fast Fault Tolerance offer an extra layer of protection against planned server downtime and server crashes. HPE Fast Fault Tolerance, an enhanced version of adaptive double device data correction (ADDDC), is a result of a Hewlett Packard Enterprise and Intel® collaboration. HPE Fast Fault Tolerance has more spare regions (part of memory allocated only for replacing bad memory areas) and more options to map out bad sections of memory. This results in significantly better memory reliability and availability than what the rest of the industry can provide using ADDDC only.
In the past server generations, the most advanced memory protection technology in HPE ProLiant servers was double device data correction (DDDC). The biggest issue with this was that it had to be enabled at boot, and it significantly reduced memory throughput when enabled. Customers had to choose between resiliency and performance. HPE Fast Fault Tolerance provides significant improvement over DDDC because it incorporates the performance benefits of single device data correction (SDDC) with the availability of DDDC. HPE Fast Fault Tolerance allows the system to boot with full-memory performance and only puts small sections (banks) of memory into lockstep when needed to correct failures resulting in a significantly better performance than DDDC. When the failing section is larger than a bank, a larger negative impact on performance may be observed.
•HPE Fast Fault Tolerance survives up to two DRAM failures (detect and correct).
•The RAS feature combines the resiliency of DDDC with the performance of SDDC.
There must be a minimum of one rank on each populated channel. Furthermore, only HPE SmartMemory in x4 organization can be used.
HPE Fast Fault Tolerance is enabled by default for all workload profiles except for the low latency profile.
HPE Fast Fault Tolerance can be enabled or disabled on any HPE Gen10 Plus server through the RBSU or the HPE RESTful API. To change the default setting in the workload profile, first the desired workload profile should be selected and then changed to Custom. At that point, HPE Fast Fault Tolerance may be enabled or disabled accordingly through the Memory Options — Advanced Memory Protection menu.
HPE Fast Fault Tolerance configuration requirements vary for each server series, but it does not require OS support or special software beyond the basic input/output system (BIOS).
Figure 1. HPE Fast Fault Tolerance enabled in the RBSU
There will also be a minimal performance reduction in throughput if a DRAM fails but only in the typically small region (most common size is a bank) of memory that is affected. No significant loss is expected for random-access memory patterns because the region in lockstep will be accessed infrequently. The loss can be significant if you have rank level virtual lockstep or if an application accesses the region frequently until the DIMM is replaced. The overall reduction in throughput from HPE Fast Fault Tolerance is expected to be minimal for the vast majority of customers but does depend on the application, the size of the affected region, and the memory configuration.