Whitepaper How to Balance Memory on 2nd Generation Intel® Xeon™ Scalable Processors Optimizing Memory Performance Matt Ogle, Technical Product Marketing, Dell EMC Trent Bates, Product Management, Dell EMC Bruce Wagner, Senior Principal Systems Development Engineer, Dell EMC Rene Franco, Senior Principal Systems Development Engineer, Dell EMC Abstract Properly configuring a server with balanced memory is critical to ensure memory bandwidth is maximized and latency is minimized.
Table of Contents 1. Introduction .................................................................................................................................................3 2. Memory Topography and Terminology.....................................................................................................4 3. Memory Interleaving ..................................................................................................................................5 4. Guidelines for Balancing Memory ..
1. Introduction Understanding the relationship between a server processor (CPU) and its memory subsystem is critical when optimizing overall server performance. Every processor generation has a unique architecture, with volatile controllers, channels and slot population guidelines, that must be satisfied to attain high memory bandwidth and low memory access latency. Memory that has been incorrectly populated is referred to as an unbalanced or near balanced configuration.
2. Memory Topography and Terminology Figure 1: PowerEdge R740 CPU-to-memory subsystem connectivity for Intel® Cascade Lake™ To understand the relationship between the CPU and memory, terminology illustrated in Figure 1 must first be defined: • The memory controllers are digital circuits that manage the flow of data going from the computer’s main memory to the corresponding memory channels.2 Intel® Xeon™ scalable processors have the controller integrated into the CPU.
3. Memory Interleaving Memory interleaving allows a CPU to efficiently spread memory accesses across multiple DIMMs. When memory is put in the same interleave set, contiguous memory accesses go to different memory banks. Memory accesses no longer must wait until the prior access is completed before initiating the next memory operation. To maximize performance, all DIMMs should be in one interleaved set creating a single uniform memory region that is spread across as many DIMMs as possible.
4. Guidelines for Balancing Memory 4.1 Overview To ensure effective interleaving is created, memory must be populated into a balanced configuration. Variables such as DIMM consistency and slot population will dictate whether a configuration is balanced or unbalanced. At both the socket and server level, memory bandwidth is optimized when the guidelines below are implemented: 1. All memory modules inside the memory subsystem are identical • They must have the same size, speed, rank count and DIMM type 2.
4.3 Identical Channel Population Intel® Xeon™ scalable processors will have six channels, identified in Figure 4 as one to six, with up to two slots per channel, identified as A and B. Optimal performance is achieved when a channel is completely populated. For example, Figure 4 would have to have the inner grey slots A1, A2, A3, A4, A5 and A6 populated to fulfill this guideline. This guideline also applies to the outer black B slots, which can only be populated once the grey slots are populated.
4.4 Identical Sockets A CPU socket must have identical memory subsystems meeting guidelines 4.2 and 4.3. A physical server should also have identically configured CPU sockets. As seen in Figure 5, when only one unique memory configuration exists across all sockets within a server, memory bandwidth is further optimized because the CPU and memory controllers can command information to be distributed across many more channels.
5. Balanced Configurations (Recommended) 5.1 Traditional DIMMs Memory controller logic was designed around having all memory slots populated to return the highest memory bandwidth, so it should come as no surprise that the top recommendation is a balanced configuration populated with twelve DIMMs.
5.2 Traditional DIMMs with DCPMMs For mixed memory configurations containing both traditional DIMMs and persistent memory modules, populating six traditional DIMMs into the first channels slots and six DCPMMs into the second channels slots will reap the highest memory bandwidth and lowest memory access latency. The benefits gained from DCPMMs persistence and increased capacity will offset the memory bandwidth degradation introduced from having two interleave sets.
6. Near Balanced Configurations Near balanced configurations also only have one interleave set. Populating DIMMs one, two and three on the same channel column will naturally only create one interleave set, but once four or more DIMMs are introduced, guideline 4.2 must be satisfied to maintain a near balanced composure. Mirrored columns must be identically populated with the same DIMMs. The four and eight DIMM illustrations below demonstrate what this looks like.
Figure 10: This configuration has one interleave set because identical memory modules are populated in the same channel column. The absence of all channels being populated makes this configuration near balanced. 12 | Whitepaper – Balanced Memory with 2nd Generation Intel® Xeon Scalable Processors © 2019 Dell Inc. or its subsidiaries.
Figure 11: This configuration has one interleave set because identical memory modules are populated in the same channel column. The absence of all channels being populated makes this configuration near balanced. 13 | Whitepaper – Balanced Memory with 2nd Generation Intel® Xeon Scalable Processors © 2019 Dell Inc. or its subsidiaries.
Figure 12: This configuration has one interleave set because identical memory modules were distributed evenly across the mirrored channels. The absence of all channels being populated makes this configuration near balanced. Figure 13: This configuration has one interleave set because identical memory modules were distributed evenly across both mirrored channels and columns. The absence of all channels and columns being populated makes this configuration near balanced. 7.
Figure 14: This configuration is unbalanced because the bottom left channel is populated while the bottom right channel is not. An additional, undesired interleave set has been introduced to accommodate the isolated memory module in the bottom left. 15 | Whitepaper – Balanced Memory with 2nd Generation Intel® Xeon Scalable Processors © 2019 Dell Inc. or its subsidiaries.
Figure 15: This configuration is unbalanced because the top left channel has both slots populated while the top right channel does not. An additional, undesired interleave set has been introduced to accommodate the isolated memory module in the top left. Figure 16: This configuration is unbalanced because both left controller slots are fully populated with memory modules while the right controller slots are not.
Figure 17: This configuration is unbalanced because the top and middle channels are fully populated, while the bottom channels are only partially populated. An additional, undesired interleave set has been introduced to accommodate the isolated memory modules in the bottom, inner slots. 17 | Whitepaper – Balanced Memory with 2nd Generation Intel® Xeon Scalable Processors © 2019 Dell Inc. or its subsidiaries.
Figure 18: This configuration is unbalanced for two reasons. First, the top and middle channels are fully populated while the bottom channel is not. Second, the bottom left channel is fully populated with two DIMMs while the bottom right channel only has one. This has caused two additional, undesired interleave sets to be introduced to accommodate these isolated memory module groups. 18 | Whitepaper – Balanced Memory with 2nd Generation Intel® Xeon Scalable Processors © 2019 Dell Inc. or its subsidiaries.
8. Conclusion Balancing memory with 2 nd Generation Intel® Xeon™ scalable processors increases memory bandwidth and reduces memory access latency. If memory is populated into a near balanced or unbalanced configuration, memory bandwidth can be reduced by up to 33% from its maximum potential.