Design Principles for HPC Design principles for scalable, multi-rack HPC systems Dell EMC HPC Engineering January 2018 A Dell EMC Technical White Paper
Revisions Date Description January 2018 Initial release The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © January 2018 Dell Inc. or its subsidiaries.
Table of contents Revisions.............................................................................................................................................................................2 Executive summary.............................................................................................................................................................4 1 Introduction ......................................................................................................................
Executive summary HPC system configuration can be a complex task, especially at scale, requiring a balance between user requirements, performance targets, data center power, cooling and space constraints and pricing. Furthermore, many choices among competing technologies complicates configuration decisions. This document describes a modular approach to HPC system design at scale where sub-systems are broken down into modules that integrate well together.
1 Introduction A high performance computing (HPC) system comprises many different components, both hardware and software, that are selected, configured and tuned to optimize performance, energy efficiency and interoperability. The number of different components, and choice in the selection of each component, can make HPC design complex and intractable, especially as systems scale to support growing performance requirements. This white paper describes the principles for HPC design used at Dell EMC.
Figure 1 – End-to-end HPC system view . 6 Design Principles for HPC | Version 1.
2 Hardware Components The use case of the HPC system is the starting point. The requirements of the applications that the system is expected to support, number of users, data storage and processing requirements, size, power, cooling and weight restrictions of the datacenter will determine the size of the HPC system. This document does not discuss how to size an HPC system for a particular set of applications, but instead focuses on the common aspects across all HPC systems.
Infrastructure servers are key to keeping the HPC system up and accessible to users. The Dell EMC PowerEdge R640 or R740 Server is recommended for this role. This is a 1U or 2U server with rich configuration choices in memory and local storage, which also provides Integrated Dell Remote Access Controller (iDRAC) enterprise manageability and systems security.
These servers are part of Dell EMC’s 14th generation server line-up and include next generation systems management with Integrated Dell Remote Access Controller 9 (iDRAC9), expanded interconnect and disk options including Non-Volatile Memory Express (NVMe) support, and support for the Intel Xeon® Scalable Processor Family or the AMD EPYC processors. Due to its density and feature set, the PowerEdge C6420 Server is the most popular HPC compute platform and described in more detail in Appendix A. 2.
2.4 Networking Options Most HPC systems have at least two network fabrics. One is used for administration and management and the second for inter process communication (IPC) and storage traffic. Depending on the use case, separate networks for management and storage traffic may also be configured. The choices for the network range from Ethernet – 1 GbE, 10GbE, 25 GbE, 40 GbE, 100 GbE to Mellanox InfiniBand and Dell EMC H-Series based on Intel® Omni-Path Architecture (OPA).
Figure 3 - Example rack configuration with 72 compute nodes, GbE and OPA switches. The OPA switches shown here could alternately be distributed between the servers to reduce intra-rack cable lengths and reduce cable congestion at the top. A popular network design for HPC uses top-of-rack (TOR) or leaf switches within the rack, and core switches that connect to the leaf switches. Section 2.2 on compute units mentioned a building block unit that includes servers and a switch.
are assumed to be in a separate rack, say along with the infrastructure nodes. Multiple such racks can be configured, and the uplinks from all these racks can be combined at the core switches as shown in Figure 4 and Figure 5. Both these schematics use the 48-port Omni-Path switch as an example, but the same principles can be applied to InfiniBand with the 36-port switch as well. Note that for larger clusters or more complex networks a custom network topology can be designed for the specific use case.
The number of compute, infrastructure and storage switch ports and required blocking factor of the fabric will determine the total number of switches needed for a specific configuration. The PowerEdge line of server platforms allows not just multiple interconnect choices but also support for multiple generations of interconnects. For example, EDR InfiniBand is an option today, but when HDR InfiniBand is available in the market, the same server platforms will support HDR as well.
3 Software Components This section describes some of the software components that make up the HPC system. As with the hardware, the software is assembled in a modular manner. There are multiple choices for individual components (like resource manager, MPI, compilers etc.) and support for drivers based on the hardware selections. The software components are validated on in-house HPC systems along with the hardware as mentioned in Section 1.1. 3.
results are available for many categories of applications including digital manufacturing, life sciences and research. Work is also being done on containers and deep learning. Tests include performance as well as power monitoring to provide effective energy efficiency recommendations. The in-house expertise and partnerships with ISVs lead to detailed best practices and HPC systems can be then configured taking into account each customer’s specific use case and priorities.
4 Data Center Considerations The final configuration of the HPC system is a series of trade-offs. An existing datacenter provides many of the bounds for the system – the physical limitations of space and “per rack” limitations of power, cooling, and weight. These topics are discussed below. 4.1 Power Configuration A trade-off with high density servers is in the input power requirements to each rack.
4.2 Rack Cooling Delivering adequate power to a rack for maximum performance without throttling is one factor that determines the density of infrastructure, but the other aspect is dissipating the heat generated to continue operations within required ambient temperature limits. Focusing first on the server itself, the hardware configuration of the platform determines the ambient temperature needs of the server.
deployments, smaller rack manifolds could also be mounted horizontally within the U-space. Depending on the power capacity that must be managed, a heat exchanger could be 2U or 4U mounted in the rack, or a standalone unit in its own rack. Figure 7 - PowerEdge C6420 liquid cooled server sled Figure 8 – Rack manifold and Heat Exchanger 4.3 Rack Weight A compute rack, as described in Figure 3 with 72 servers and five top-of-rack switches, will weigh ~1800 lbs. and consume up to 43kW of power.
weight of the rack. In these cases, the trade-off will be to reduce server density to acceptable power and weight limits. For example, assuming a datacenter weight limit of 1500 lbs. per rack and 25kW of power per rack, a reasonable configuration would be 10 chassis for 40 PowerEdge C6420 servers in the rack with two top of rack Omni-Path switches and one Ethernet switch. This rack would contain 23U of equipment and satisfy both weight and maximum power limits. 4.
5 Scaling the Design HPC systems are scale out by definition. As needs grow, additional compute power can be added to a system with more CPUs, GPUs or FPGA-based servers, additional storage capacity or throughput can be added with more storage arrays. The network must support this expansion. Ideally, the architecture of the system should accommodate this need to scale from the design phase. The design principles presented in this paper make scaling easy.
6 Enterprise Support and Services A good design and a well-configured HPC system is the starting point for the end users, equally important is services and support for the entire life cycle of the system. HPC trained services team can rack and stack, install and complete the software configuration of the system if desired.
7 Conclusion This document presents Dell EMC’s HPC design principles. HPC systems are configured based on careful design that includes all aspects from hardware, software, application performance, data center considerations, power and cooling. Extensive evaluation in the HPC and AI Innovation Lab leads to best practice recommendations including performance characteristics across multiple HPC domains.
A Appendix – Dell EMC PowerEdge C6420 Server The PowerEdge C6420 Server is the most popular HPC compute platform and is described in more detail here. Four server sleds are hosted in a 2U chassis as shown in Figure 11. The chassis provides shared infrastructure for power and cooling. Each individual sled is a standard two socket server platform with individual networking as seen in Figure 12.
Figure 13 - PowerEdge C6400 chassis with 24 2.5" disks (front view) Some key features of this platform are provided in Table 1 and Table 2. Table 1 – PowerEdge C6400 chassis PowerEdge C6420 chassis Chassis PowerEdge C6400, 2U chassis Chassis options for disks and backplane 3.5” disks, up to 12 disk drives in chassis 2.5” disks, up to 24 disk drives in chassis 2.
Disk Controller PERC 9 (H330, H730P), HBA330 and chipset RAID controller Hard disk drives With 4 server sleds in a chassis - up to 6 2.5” drives per server with up to 2 as NVMe; or up to 3 3.5” drives per server With 2 server sleds in chassis - up to 12 2.5” drives per server PCI-e slots Network options On board 1GbE RJ45 connector 1 x16 PCI-e slot 1 x8 mezzanine slot for internal storage controller 1 x16 mezzanine slot for network cards 1 x16 buried riser for M.