[Type here] Deep Learning Performance Comparing Scale-out vs Scale-up Abstract This whitepaper looks at the performance and efficiency of Deep Learning training when using the Dell EMC PowerEdge C4140 server to run neural models. The objective is to show how the C4140 in scale-out configuration performs against scale-up server.
Deep Learning Performance: Scale-up vs Scale-out Revisions Date Description February 2019 Initial release Acknowledgements This paper was produced by the following persons: Author: Bhavesh Patel, Dell EMC Server Advanced Engineering. Vilmara Sanchez, Dell EMC Server Advanced Engineering Contributor: Josh Anderson, Dell EMC System Engineering The information in this publication is provided “as is.” Dell Inc.
Deep Learning Performance: Scale-up vs Scale-out Contents 1 Overview ............................................................................................................................................... 5 1.1 2 Introduction .......................................................................................................................................... 6 2.1 3 4 Deep Learning ............................................................................................................
Deep Learning Performance: Scale-up vs Scale-out 7.1.10 7.2 Non-Dell EMC server: 8x V100-16GB-SXM2 – Single Node ................................................ 30 Throughput images/s – Multi Node ............................................................................................ 31 7.2.1 PowerEdge C4130-P100 16GB PCIe- Multi Node................................................................ 31 7.2.2 PowerEdge C4140-K-V100-16GB and V100-32GB: SXM2 Multi Node ..............................
Deep Learning Performance: Scale-up vs Scale-out Acknowledgements We would like to acknowledge the following individuals, Jaime Edwards (Director of PowerEdge Advanced Engineering) for setting the direction of this project, April Berman (PowerEdge Acceleration Product Manager), Shreya Shah (PowerEdge C4140 Product Manager) and Trevor Montgomery (Enterprise Systems Group Business Development) for providing us the resources for this paper.
Deep Learning Performance: Scale-up vs Scale-out 1 Overview The objective of this whitepaper is to compare Dell’s PowerEdge acceleration optimized servers and determine their performance when running deep learning workloads. The purpose is to highlight how Dell’s scale out solution is ideally suited for these emerging workloads.
Deep Learning Performance: Scale-up vs Scale-out 2 Introduction Figure 1: Artificial Intelligence, Machine Learning and Deep Learning [Source: MIT] Artificial Intelligence First coined in 1956 by John McCarthy, AI involves machines that can perform tasks that are characteristic of human intelligence. While this is rather general, it includes things like planning, understanding language, recognizing objects and sounds, learning, and problem solving.
Deep Learning Performance: Scale-up vs Scale-out electrical charge – reaches a specific value. When a neuron fires, it generates a signal which travels to other neurons which, in turn, increase or decrease their potentials in accordance with this signal. 2.1 Deep Learning Deep Learning consists of two phases: Training and inference. As illustrated in Figure 2, training involves learning a neural network model from a given training dataset over a certain number of training iterations and loss function.
Deep Learning Performance: Scale-up vs Scale-out The plot in Figure 3 shows GPU performance when looking at single precision and Figure 4 shows GPU performance when looking at half-precision. Most of the Deep Learning frameworks and models take advantage of half-precision since they can work with larger datasets with the available memory. It’s very important to look at the raw Flop numbers for a GPU, since we want to extract the same level of performance when that GPU is put into a system.
Deep Learning Performance: Scale-up vs Scale-out Figure 4 : GPU performance - Half-precision [7] In order to see whether Dell EMC PowerEdge servers can meet the raw Flop numbers indicated in the charts above, we approached the problem by breaking it up into different sections. The picture below better illustrates how we are approaching this testing.
Deep Learning Performance: Scale-up vs Scale-out 1. System bandwidth performance i.e. PCIe connected to GPU - p2pbandwidth & latency tests 2. GPU hardware performance without any Deep learning frameworks – Baidu Deep Bench 3. System running GPU & benchmarks – TensorFlow benchmarks 3.1 Criteria 1. In order to bound our testing, we picked TensorFlow as the framework of choice since it has better support and models are readily available. 2.
Deep Learning Performance: Scale-up vs Scale-out Figure 6: Frameworks Comparison 4 Test Methodology The test methodology consists of 3 distinct phases. Phase 1 is where we test the hardware performance of each server using NVidia supplied p2pbandwidth and latency tests and Baidu Deep Bench. This is explained in section Phase 1. Phase 2 we used TensorFlow framework and ran some of the well-known neural models to compare performance in terms of throughput & training time.
Deep Learning Performance: Scale-up vs Scale-out Phase 1 Phase 3 Phase 2 Figure 7: Testing Methodology Workflow 1) Phase 1 – In this phase we are performing some of the basic tests like PCIe bandwidth and latency tests to ensure it aligns to what we expect based on theoretical numbers. We then ran Baidu Deep Bench Benchmarks to evaluate deep learning performance for the accelerators and the system. The results for this step are presented in a separate whitepaper.
Deep Learning Performance: Scale-up vs Scale-out 4.1.2 Long Test The long tests were run to get throughput and the training time to reach certain accuracy convergence. We used 90 epochs for training run. These tests were run using the maximum number of GPUs supported by that server. In the section below, we describe the setup used, and Table 1 gives an overall view on the test configuration.
Deep Learning Performance: Scale-up vs Scale-out Performance Metrics Training Tests Dataset Throughput images/second Top-5 Accuracy on the training dataset Training time Short Tests to get throughput images/second Long Tests to get accuracy convergence and training time ILSVRC2012 Table 1: Benchmark Setup 4.
Deep Learning Performance: Scale-up vs Scale-out 5 PowerEdge Server Details 5.1 PowerEdge C4140 The Dell EMC PowerEdge C4140, an accelerator-optimized, high density 1U rack server, is used as the compute node unit in this solution. The PowerEdge C4140 can support four NVIDIA Volta SMX2 GPUs, both the V100-SXM2 as well as the V100-PCIe models.
Deep Learning Performance: Scale-up vs Scale-out Configuration K X16 IO Slot CPU1 UPI CPU2 X16 IO Slot x16 PCIe Switch x16 x16 x16 x16 SXM1 SXM2 SXM4 SXM3 Figure 8: C4140 Configuration-K NVLink Architectures & Technologies PCIe Dell EMC | Infrastructure Solutions Group 16
Deep Learning Performance: Scale-up vs Scale-out Configuration M X16 IO Slot CPU1 UPI X16 IO Slot CPU2 x16 x16 x16 x16 SXM1 SXM2 SXM4 SXM3 Figure 9: C4140 Configuration-M NVLink Architectures & Technologies PCIe Dell EMC | Infrastructure Solutions Group 17
Deep Learning Performance: Scale-up vs Scale-out Configuration B X16 IO Slot CPU1 UPI CPU2 X16 IO Slot x16 PCIe Switch x16 x16 x16 V100 PCIe V100 PCIe V100 PCIe V100 PCIe x16 Figure 10 : PowerEdge C4140 Configuration B 5.2 PowerEdge R740/R740xd PCIe The PowerEdge R740/R740xd is a general-purpose platform with highly expandable memory (up to 3TB) and impressive I/O capability to match both read-intensive and write-intensive operations.
Deep Learning Performance: Scale-up vs Scale-out SAS SSD X16 Port 3 CD AB X16 Port 2 AB CD CPU2 UPI CPU1 X16 Port 1 AB CD X16 Port 2 CD AB x16 x16 X16 Port 3 CD AB x16 x16 x16 PCIe Slot X16 Port 1 CD AB GPUDWFL 300W PERC Ethernet GPUDWFL 300W GPUDWFL 300W Figure 11: Dell PowerEdge R740/R740xd 6 Framework Setup Details 6.1 Distributed Horovod-TensorFlow Setup Horovod [8] [9] [10] is a distributed training framework for TensorFlow, Keras and PyTorch initially developed by Uber.
Deep Learning Performance: Scale-up vs Scale-out The tests were run in docker environment, Figure 12 shows the different logical layers involved in the software stack configuration. Each server is connected to the InfiniBand switch; has installed on the Host the Mellanox OFED for Ubuntu, the Docker CE, and the GPUDirect RDMA API; and the container image that was built with Horovod and Mellanox OFED among other supporting libraries.
Deep Learning Performance: Scale-up vs Scale-out In Figure 14 we see how the GPU memory is accessed directly instead of copying the data n times across the system components with the use of GPUDirect RDMA, this feature is reflected directly in the throughput performance of the server. Figure 14: Nvidia GPU Direct RDMA Connection. Source: https://www.sc-asia.org 6.2 Evaluation Platform Setup Table 4 shows the software stack configuration used to build the environment to run the tests.
Deep Learning Performance: Scale-up vs Scale-out 7 Performance Results 7.1 Single Node – Throughput (images/sec) The charts below show the results for different servers running the short tests to extract throughput images/second using ResNet50 with batch size 128 and number of steps =100. The results for single node are with maximum number of GPUs supported within that node. 7.1.
Deep Learning Performance: Scale-up vs Scale-out 7.1.2 PowerEdge C4140-V100-16GB-PCle [Config B] – Single Node Figure 16: PowerEdge C4130-P100-16GB-PCle in single-node 7.1.
Deep Learning Performance: Scale-up vs Scale-out 7.1.
Deep Learning Performance: Scale-up vs Scale-out 7.1.5 PowerEdge C4140-M- V100-16GB SXM2 Single Node Figure 19. PowerEdge C4140-M-V100-16GB-SXM2 in single-node 7.1.6 7.1.7 PowerEdge C4140- V100-SXM2 Configuration K versus Configuration M - Single Node The plot below is a comparison showing performance difference between Config-K and ConfigM, although it’s not a like-like comparison because of CPU difference. Config-K has Intel Xeon 4116 @ 2.1GHz 12core processor Vs Config-M which has Intel Xeon 6148 @2.
Deep Learning Performance: Scale-up vs Scale-out Figure 20: PowerEdge C4140-V100-SXM2- Configuration-K vs PowerEdge C4140-V100-SXM2 Configuration-M As shown in Figure 21 below, it shows that the number of CPU cores does play a role in terms of throughput. And the biggest difference is when running AlexNet. 7.1.8 What role does CPU play in Deep learning? The CPU plays a major role in the initial phase called data preprocessing.
Deep Learning Performance: Scale-up vs Scale-out Figure 21 : Performance difference between Intel Xeon-4116 & Intel Xeon-6148 in C4140-M As you notice in Figure 21, there is a slight performance difference for most of the neural models when using Intel Xeon 6148 except for AlexNet where it shows almost doubling in performance.
Deep Learning Performance: Scale-up vs Scale-out 7.1.9 Conclusion 7.1.9.
Deep Learning Performance: Scale-up vs Scale-out 7.1.9.2 Single Node: 4GPU Figure 23 : PowerEdge C4140-B P100-16GB PCIe vs C4140-K V100-SXM2: 4GPU As shown in the plots above in Figure 22 and Figure 23 there is not much difference in throughput (images/sec) when comparing PowerEdge C4140-Configuration B with V10016GB PCIe GPU and Configuration K with V100-16GB SXM2 GPU. The reason is in both configurations GPUs are in peer-peer mode behind PCIe Switch.
Deep Learning Performance: Scale-up vs Scale-out 7.1.
Deep Learning Performance: Scale-up vs Scale-out 7.2 Throughput images/s – Multi Node 7.2.1 PowerEdge C4130-P100 16GB PCIe- Multi Node PowerEdge C4130 each with 4 P100-PCIe GPUs were configured in multi-node using InfiniBand RDMA to run the TensorFlow in distributed mode. Figure 25: Training with PowerEdge C4130-P100-16GB-PCle in multi-node PowerEdge C4130 server scales very well within a node with 97% efficiency and 92% across the nodes.
Deep Learning Performance: Scale-up vs Scale-out Figure 26: Scaling Efficiency of C4130-P100-16GB-PCle across multi GPUs and multi nodes Architectures & Technologies Dell EMC | Infrastructure Solutions Group 32
Deep Learning Performance: Scale-up vs Scale-out 7.2.2 PowerEdge C4140-K-V100-16GB and V100-32GB: SXM2 Multi Node Figure 27: Training with PowerEdge C4140-V100-16&32GB-SXM2 in multi-node PowerEdge C4140-V100-16GB-SXM2 and PowerEdge C4140-V100-32GB-SXM2 with 4 GPUs each were configured in multi-node to run the TensorFlow in distributed mode, extract the throughput performance, and determine its scaling efficiency. The GPUs scale very well within a node to 97% and 90% across the nodes.
Deep Learning Performance: Scale-up vs Scale-out Figure 28: Scaling Efficiency of PowerEdge C4140-V100-16GB-SXM2 and PowerEdge C4140-V100-32GBSXM2 across multi GPUs and multi nodes Architectures & Technologies Dell EMC | Infrastructure Solutions Group 34
Deep Learning Performance: Scale-up vs Scale-out 7.2.3 PowerEdge C4140-M-V100-16GB-SXM2 Multi Node Figure 29.
Deep Learning Performance: Scale-up vs Scale-out Figure 30. Scaling Efficiency of PowerEdge C4140-M-V100-16GB-SXM2 across multi nodes 7.2.4 PowerEdge C4140-K Multi Node Training vs Non-Dell EMC 8x V100-16GB-SXM2 The Non-Dell EMC 8x V100-16GB- SXM2 system was tested on Nimbix cloud. Figure 31 shows its throughput performance of 8X SXM2 and shows the comparison versus PowerEdge C4140-K-V100 in distributed mode (8 GPUs).
Deep Learning Performance: Scale-up vs Scale-out Figure 31: Training with PowerEdge C4140-K-V100-16&32GB-SXM2 (8 GPUs) – multi-node versus Non-Dell EMC SN_8x-V100-16GB-SXM2 SN_8X V100_16GB- SXM2 MN- PowerEdge C4140% Diff K-V100-SXM2 (16Gb &32GB)-IntelXeon4116 Inception-v4 1606 1625 -1.21% VGG-19 2449 2406 1.78% VGG-16 2762 2820 -2.03% Inception-v3 3077 2845 8.16% ResNet-50 4852 4500 7.81% GoogLeNet 7894 8754 -9.82% AlexNet 16977 12145 39.
Deep Learning Performance: Scale-up vs Scale-out distributed framework Horovod over IB/GPUDirect-RDMA, see below Figure 32 the scaling efficiency reached by PowerEdge C4140: Figure 32: The Performance with Distributed Horovod TensorFlow, connected by Mellanox ConnectX-5 network adapter with 100Gbit/s over IPoIB, and GPUDirect RDMA Architectures & Technologies Dell EMC | Infrastructure Solutions Group 38
Deep Learning Performance: Scale-up vs Scale-out 7.2.5 PowerEdge C4140-M Multi Node Training vs Non-Dell EMC 8x V100-16GB-SXM2 Figure 33.
Deep Learning Performance: Scale-up vs Scale-out 7.2.6 PowerEdge C4140 Multi Node Training with Different CPU Models vs 8x V100-16GBSXM2 In the results shown in the Figure 31 and Figure 33 we configured the multi-node system with servers PowerEdge C4140-V100-SXM2- Configuration-K Intel Xeon4116 CPU and ConfigurationM Intel Xeon6148 CPU respectively, versus single-node training non-Dell EMC 8xV100-16GBSXM2.
Deep Learning Performance: Scale-up vs Scale-out 7.3 Results Showing Training Time Vs Accuracy These tests were run with 90 epochs to determine training time to achieve top-1% and top-5% accuracy. Figure 35 shows the results for the server 8X SXM2, POWEREDGE C4140-V100 in single and multinode mode, C4130-P100 in single and multi-node mode, and R740-P40. Results Highlights The fastest training time was achieved by the system 8X SXM2 with 93% of accuracy convergence in 6.6 hours.
Deep Learning Performance: Scale-up vs Scale-out Figure 36: Relative speed performance based on training time After training the PowerEdge C4140 – Configuration M with SXM2 in multi-node configuration, we saw it reached the fastest training in 5.3 hours, overpassing the Non-Dell EMC SN_8x-V10016GB-SXM2 which completed the training time in 6.6 hours. See Figure 36 7.3.
Deep Learning Performance: Scale-up vs Scale-out Figure 37: Training long tests to extract accuracy convergence and training time with 8X SXM2 and different models Figure 37 above we observe that the models ResNet50 and Inception-v4 reached between 92%96% of accuracy convergence in 90 epochs; however, ResNet50 with different batch sizes Architectures & Technologies Dell EMC | Infrastructure Solutions Group 43
Deep Learning Performance: Scale-up vs Scale-out converged faster than Inception-v4. On the other hand, the model VGG-19 didn’t produce acceptable accuracy suggesting it requires over 90 epochs to converge.
Deep Learning Performance: Scale-up vs Scale-out Figure 39: Training long tests to extract accuracy convergence and training time with PowerEdge C4140K multi-node and single-node 8x V100-SXM2 with different models Figure 39 above shows comparison between 8X SXM2 and PowerEdge C4140 Configuration-K in multi-node configuration using ResNet-50.
Deep Learning Performance: Scale-up vs Scale-out Figure 40. Multi-node training PowerEdge C4140-V100-SXM2- Configuration-K with IntelXeon4116 cpu, Multi-node training PowerEdge C4140-V100-SXM2 Configuration-M with IntelXeon6148 cpu, versus single-node training non Dell 8xV100-16GB-SXM2 In the Figure 40 we can see how the system C4140-V100-SXM2 Configuration-M outperforms in terms of training time in different batch sizes compared the other systems. 7.
Deep Learning Performance: Scale-up vs Scale-out 7.4.1 Hyper-parameters tuning The section below are the commands with the hyper-parameter tuning used to maximize the throughput performance in single and distributed mode server implementations. Figure 41 shows the high impact of the hyper-parameter tuning in the throughput performance: Single Node – TensorFlow: #python3 tf_cnn_benchmarks.
Deep Learning Performance: Scale-up vs Scale-out Figure 41: Effect of the hyper-parameter tuning in the throughput performance 7.4.2 Learning Rate Effect in Distributed Mode In this experiment, we used the learning rate schedule as follows: the initial learning rate was set up to 0.4 for the first 10 epochs, after that the learning rate was decreased to 0.04 until the model reached 60 epochs of training, finally it was decreased to 0.004 until the end of the training with 90 epochs.
Deep Learning Performance: Scale-up vs Scale-out Figure 42: Training with PowerEdge C4140-V100-16&32GB-SXM2 (4 GPUs) – single-node Figure 43: Training with PowerEdge C4140-V100-16&32GB-SXM2 (8 GPUs) – multi-node Architectures & Technologies Dell EMC | Infrastructure Solutions Group 49
Deep Learning Performance: Scale-up vs Scale-out 7.4.3 Communication and Neural Networks Primitives We wanted to explore the critical kernels executed in one GPU when running the TensorFlow benchmarks, so we used the Nvidia profiling tool Nvprof to analyze one TensorFlow benchmark trained with C4130-P100-PCIe-16GB in multi-node. Figure 44 shows the critical kernels executed, we found that the communication primitives all reduce where called 38.
Deep Learning Performance: Scale-up vs Scale-out 8 Conclusion and Future Work PowerEdge C4140 using Nvidia 4x NVLink architecture scales relatively well when using Uber Horovod distributed training library and Mellanox InfiniBand RDMA as the highspeed link between nodes. Table 5 shows that PowerEdge C4140 in multi-node configuration for most widely used model ResNet-50 is within 7.8% of single node Non-Dell EMC 8x-NVLink system.
Deep Learning Performance: Scale-up vs Scale-out 9 Citation @article {sergeev2018horovod, Author = {Alexander Sergeev and Mike Del Balso}, Journal = {arXiv preprint arXiv: 1802.05799}, Title = {Horovod: fast and easy distributed deep learning in {TensorFlow}}, Year = {2018} } 10 References [1] Nvidia Blogs, “What’s the Difference between Artificial Intelligence, Machine Learning, and Deep Learning?” [Online]. Available: https://blogs.Nvidia.