Dell HPC Lustre Storage A Dell Technical White Paper Quy Ta Dell HPC Engineering Innovations Lab September 2016 | Version 1.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. © 2016 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Contents 1. Introduction ........................................................................................................... 1 2. The Lustre File System .............................................................................................. 1 3. Dell HPC Lustre Storage with Intel EE for Lustre Description ................................................ 3 3.1 Management Server ................................................
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR 1. Introduction In high performance computing (HPC), the efficient delivery of data to and from the compute nodes is critical and often complicated to execute. Researchers can generate and consume data in HPC systems at such speed that turns the storage components into a major bottleneck. Getting maximum performance for their applications require a scalable storage solution.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR pairs. Each additional OSS increases the existing networking throughput, while each additional OST increases the storage capacity. Figure 1 shows the relationship of the MDS, MDT, MGS, OSS and OST components of a typical Lustre configuration. Clients in the figure are the HPC cluster’s compute nodes.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Object Storage Target (OST) – Stores the data stripes or extents of the files on a file system. Object Storage Server (OSS) – Manages the OSTs, providing Lustre clients access to the data. Lustre Clients – Access the MDS to determine where files are located, then access the OSSs to read and write data Typically, Lustre configurations and deployments are considered very complex and time-consuming tasks.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Figure 2: Dell HPC Lustre Storage Solution Components Overview There are several new architectural changes in this release compared to previous release. The solution continues to use the Dell PowerEdge R630 server as the Intel Management Server, while the Object Storage Servers and Metadata Servers in the configuration will be based on the Dell PowerEdge R730.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Figure 3: Dell PowerEdge R730 1 6 4 2 5 3 2 7 750W iDRAC 3.1 1 2 3 4 750W 1 Management Server The Intel Manager Server is a single server connected to the Metadata servers and Object Storage servers via an internal 1GbE network.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Figure 4: Metadata Server Pair 1 1 2 4 5 3 5 2 7 1 2 3 4 2 4 5 3 750W iDRAC 1 1 6 6 5 1 2 7 750W 750W 1 iDRAC 2 3 4 750W 1 MD3420 #1 SERVER SAS PCI SLOT SAS PORT MD3420 ARRAY ieel3-mds1 ieel3-mds1 ieel3-mds2 ieel3-mds2 Slot Slot Slot Slot Port Port Port Port MD3420 MD3420 MD3420 MD3420 1 5 1 5 0 0 0 0 MD3420 CONTROLLER #1 #1 #1 #1 Controller 0 Controller 1 Controller 0 Controller 1 MD3420 CONTROLLER PO
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Figure 5: Metadata Server Pair with Lustre DNE option 1 1 2 ieel-mds1 4 5 3 5 2 7 1 2 3 4 2 ieel-mds2 4 5 3 750W iDRAC 1 1 6 6 5 1 750W iDRAC 1 2 3 4 1 SERVER SAS PCI SLOT SAS PORT MD3420 ARRAY MD3420 CONTROLLER ieel3-mds1 ieel3-mds1 ieel3-mds1 ieel3-mds1 ieel3-mds2 ieel3-mds2 ieel3-mds2 ieel3-mds2 Slot Slot Slot Slot Slot Slot Slot Slot Port Port Port Port Port Port Port Port MD3420 #1 MD3420 #1 MD3420 #2 MD3
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Figure 6: Object Storage Server Pair 1 2 3 1 2 ieel-oss1 4 5 6 5 6 1 2 7 3 750W iDRAC 1 2 3 4 2 1 2 ieel-oss2 4 5 6 5 750W 1 6 2 7 750W 1 iDRAC 2 3 4 1 MD3460 #1 MD3460 #2 MD3460 #3 MD3460 #4 SERVER SAS PCI SLOT SAS PORT MD3460 ARRAY ieel3-oss1 ieel3-oss1 ieel3-oss1 ieel3-oss1 ieel3-oss2 ieel3-oss2 ieel3-oss2 ieel3-oss2 ieel3-oss1 ieel3-oss1 ieel3-oss1 ieel3-oss1 ieel3-oss2 ieel3-oss2 ieel3-oss2 ieel3-oss2
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Targets per enclosure. By using RAID 6, the solution provides higher reliability at a marginal cost on write performance (due to the extra set of parity data required by each RAID 6). Each OST provides about 29TB of formatted object storage space when populated with 4TB HDD. With the Dell HPC Lustre Storage solution, each MD3460 provides 6 OSTs. The OSTs are exposed to clients with LNet via Infiniband EDR connections.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Figure 8: OSS Scalability Management Target (MGT) Management Network Metadata Target (MDT) Metadata Servers Object Storage Targets (OSTs) Object Storage Targets (OSTs) Object Storage Servers Object Storage Servers Intel Manager for Lustre High Performance Data Network (Intel Omni-Path, InfiniBand, 40 or 10GbE) Compute clients Scaling the Dell HPC Lustre Storage can be achieved by adding additional OSS pairs with storage backend, demonstr
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR interface using IPoIB (i.e. ifcfg-ib0) as well as your 10GbE Ethernet interface (i.e. eth0) on your OSS servers to both participate in the Lustre Network. In the Infiniband EDR fabric, fast transfer speeds with low latency can be achieved. LNet leverages the RDMA capabilities of the InfiniBand fabric to provide faster I/O transport and lower latency compared to typical networking protocols. 3.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR 4. Performance Evaluation and Analysis The performance studies presented in this paper profile the capabilities of the Dell HPC Lustre Storage with Mellanox Infiniband EDR in a 240-drive configuration. The configuration has 240 – 4TB disk drives (960TB raw space). The goal is to quantify the capabilities of the solution, points of peak performance and the most appropriate methods for scaling.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR The test environment for the solution has a single MDS pair and a single OSS pair with a total of 960TB of raw disk space. The OSS pair contains two PowerEdge R730s, each with 256GB of memory, four 12Gbps SAS controllers and a single Infiniband EDR adapter. Consult the Dell HPC Lustre Storage Configuration Guide for details of cabling and expansion card locations.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR echo 3 > /proc/sys/vm/drop_caches In addition, to simulate a cold cache on the server, a “sync” was performed on all the active servers (OSS and MDS) before each test and the kernel was instructed to drop caches with the same commands used on the client. In measuring the performance of the Dell HPC Lustre Storage solution, all tests were performed with similar initial conditions.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR respectively. The write and read performance rise sharply as we increase the number of process threads up to 24 where we see level out as we near 256 threads. This is partially a result of increasing the number of OSTs utilized, as the number of threads increase (up to the 24 OSTs in our system). To maintain the higher throughput for an even greater number of files, increasing the number of OSTs is likely to help.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Figure 11: N-to-N Random reads and writes IOPS Iozone Random - Dell HPC Lustre Storage with Infiniband EDR 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 4 8 16 24 32 48 64 72 96 128 256 Number of concurrent threads Read 4.3 Write Metadata Testing Metadata testing measures the time to complete certain file or directory operations that return attributes.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Also during the preliminary metadata testing, we concluded that the number of files per directory significantly affects the results, even while keeping constant the total number of files created.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR Figure 12: File Metadata Operations MDtest Files Metadata - Dell HPC Lustre Storage with Infiniband EDR 600000 500000 OPS 400000 300000 200000 100000 0 12 16 24 32 48 64 File Create 72 96 120 128 144 168 192 216 240 Threads File Stat File Remove Figure 12 illustrates the file metadata results using MDtest.As shown in this graph, file create metadata operations start with a little more than 13.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR The max_pages_per_rpc parameter is a tunable that sets the maximum number of pages that will undergo I/O in a single RPC to that OST. [root@node057 ~]# lctl set_param osc.*.max_rpcs_in_flight=64 The max_rpcs_in_flight is a tunable that sets the maximum number of concurrent RPCs in flight to the OST. This parameter in majority of cases will help with small file IO patterns. [root@node057 ~]# lctl set_param llite.*.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR The continued use of generally available, industry-standard benchmark tools like IOzone and MDtest provide an easy way to match current and expected growth with the performance outlined. The profiles reported from each of these tools provide sufficient information to align the configuration of the Dell HPC Lustre Storage Solution with the requirements of many applications or group of applications.
Dell HPC Lustre Storage solution with Mellanox Infiniband EDR -I -y -u -t -F -D number of items per directory in tree sync file after writing unique working directory for each task time unique working directory overhead perform test on files only (no directories) perform test on directories only (no files) References Dell HPC Lustre Storage Solution Brief http://salesedge.dell.