Whitepaper Memory Errors and Dell EMC PowerEdge YX4X Server Memory RAS Features Revision: 1.2 Issue Date: 9/9/2020 Issue Date: 9/16/2020 Introduction Memory sub-system errors are some of the most common types of errors seen on modern computing systems. Understanding how memory errors occur and how to prevent or avoid them can be a complex subject – one that has challenged countless numbers of industry researchers and developers over the last 30 years.
Revisions Date Description January 3, 2020 • Initial release • Removed content for platforms based on AMD EPYC and Xeon E processors Added more information to primer on uncorrectable errors Added clarification on PPR resources for genuine Dell DIMMs Added MEM8000 SEL event to recommended user actions list Added clarification to MEM9072 SEL event details and recommended user action Added content specific to updates contained in BIOS 2.7.
Fred Spreeuwers IPS Engineering, Technical Staff, Dell EMC Mark Dykstra IPS Engineering, Senior Principal Engineer, Dell EMC Rene Franco Memory Systems Engineering, Senior Manager, Dell EMC Mark Farley Component Quality Engineering, Senior Principal Engineer, Dell EMC A Primer on Memory Errors To fully understand the memory RAS response capabilities of PowerEdge servers, it is first helpful to understand the various types of possible memory errors.
o o o Uncorrectable Errors (UCEs) o Uncorrectable errors are multi-bit errors that could not be corrected by the server platform. These can be caused by any combination of soft or hard errors, but typically occur as a result of multiple hard errors. o Not all multi-bit errors are uncorrectable. CPUs that support Advanced ECC can correct some types of multi-bit errors, depending on the bit error pattern.
Unconsumed Poisoned upon detection; error waits to be consumed Error waits to be consumed A Primer on Dell EMC PowerEdge Server Memory RAS Capabilities Previously discussed memory errors are mitigated through PowerEdge server memory RAS capabilities which entail fault avoidance, detection, and correction in hardware and software. These mitigating RAS features are all intended to improve system reliability and extend uptime in the event of memory errors.
feature that is based on the concept of Single Symbol Correcting – Double Symbol Detecting (SSC-DSD) Reed-Solomon error correcting and detection code [3]. At a high level, SSC-DSD works by breaking up cache line accesses into ‘code words’ which in turn are made up of multi-bit symbols. The size of these symbols can vary depending upon the processor architecture.
3 74 75 76 1 2 3 4 XXXX XXXX 78 79 80 81 82 83 84 85 86 87 88 89 5 6 7 8 9 10 11 12 13 14 15 16 ... 137 138 139 140 141 142 143 144 65 66 67 68 69 70 71 72 Figure 2 - Advanced ECC can correct multi-bit errors in a single symbol… 73 74 75 76 78 79 80 81 82 83 84 85 86 87 88 89 1 2 3 4 5 X X 7 8 9 10 11 12 13 14 15 16 6 ...
Memory Configuration Required • Two or more memory ranks per memory channel Adaptive Double Device Data Correction (ADDDC) is an Intel platform-specific technology that allows for two DRAM devices to sequentially fail before loss of fault-avoidance. ADDDC is only supported with x4 DIMM populations and requires a memory configuration of two or more memory ranks channel (two DIMMs per channel or a single DIMM with multiple ranks).
in the BIOS setup under the power management menu. Memory patrol scrub may have an impact on system performance for some workloads while it is running. FYI: Demand Scrub occurs when the memory controller encounters a correctable error during a regular run-time read transaction and writes back corrected data. The usefulness of Patrol Scrub is highlighted in scenarios where memory access patterns are highly focused in some areas and thus the other areas are not getting the benefits of Demand Scrub.
scrubbing then seamlessly copy the contents of the degraded rank to the spare rank(s). Memory rank sparing is disabled by default and can be enabled in BIOS setup if required.
o E.g. One 32 GB RDIMM (2Rx4) and one 16 GB RDIMM (2Rx8) installed = two 16 GB ranks and two 8 GB ranks. Both 16 GB ranks will be held as spares, resulting in a 66% capacity reduction.
Important: Consult your PowerEdge server installation and service manual for complete memory population guidelines to properly enable Memory Mirroring.
Memory channels must be populated with all one DIMM or all two DIMMs (for example, 24 DIMM systems should have 12 DIMMs or 24 DIMMs installed). Fault Resilient Memory is disabled by default and must be enabled through the BIOS setup menu. Important: Consult your PowerEdge server installation and service manual for complete memory population guidelines to properly enable Fault Resilient Memory.
Figure 7 - PPR for a row in a bank group of a 4Gb x4 device PPR is always available on PowerEdge server platforms that support it and if deemed necessary by BIOS will automatically execute after a system cold reboot. For PPR to successfully execute, it is recommended that users do not swap or replace DIMMs between boots when receiving memory error event messages, unless instructed to do so by Dell technical support personnel.
• • If the impacted data was in user/application/VM memory, then the OS will terminate the associated process or VM without impacting the rest of the system. If the impacted data was in user/application/VM memory but the OS had a redundant copy of the data, then the associated process or VM will recover. Consult your operating system documentation on error containment for more information on OS behaviors.
o Benefit: Patrol scrub will run every four hours (instead of 24); increased frequency will reduce the accumulation of errors in areas of memory with low utilization and thus not being corrected by demand scrub It is also recommended that users keep their PowerEdge server firmware up to date, especially server BIOS. This is because even after products ship, PowerEdge server development continuously works to improve its RAS algorithms and behaviors for an optimal customer experience.
• • • • location (note that BIOS may initiate more reboots during this process). Do not remove or swap the DIMM at the specified location in the event message. MEM0804 – This is an indication that the system has successfully performed memory-self healing at the specified DIMM location in the event message. o Recommended Response Action: No response required. DIMM is operating nominally.
• • • • • • • • • • • • • • PowerEdge XR2 PowerEdge R440 PowerEdge R540 PowerEdge R640 PowerEdge R740 PowerEdge R740xd PowerEdge R740xd2 PowerEdge R840 PowerEdge R940 PowerEdge R940xa PowerEdge FC640 PowerEdge M640 PowerEdge MX740c PowerEdge MX840c PowerEdge servers with Xeon E and AMD EPYC processors are not covered in this whitepaper. Customers with these servers should continue to refer to v1.0 of the RAS whitepaper. What’s New in BIOS 2.8.
the CE rate detection scheme associated with the MEM8000 event. However, the improvement resulted in an uptick in MEM8000 events that were not substantiated by results from memory component failure analysis.
Legal Notices THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. Copyright © 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Intel and Xeon are trademarks of Intel Corporation or its subsidiaries.