Whitepaper Deep Learning Performance Scale-out Revision: 1.2 Issue Date: 3/16/2020 Issue Date: 3/16/2020 Abstract In this whitepaper we evaluated the training performance of Scale-out implementation with the latest software stack and compared it with the results obtained in our previous paper [0] . Using TensorFlow as the primary framework and tensorflow benchmark models, the performance was compared in terms of throughput images/sec on ImageNet dataset, at a single node and multinode level.
Revisions Date Description 3/16/2020 Initial release Acknowledgements This paper was produced by the following: Name Vilmara Sanchez Dell EMC - Software Engineer Bhavesh Patel Dell EMC - Distinguished Engineer Josh Anderson Dell EMC - System Engineer (contributor) We would like to acknowledge: ❖ Technical Support Team - Mellanox Technologies ❖ Uber Horovod GitHub Team ❖ Nvidia Support team 2 Deep Learning Performance Scale-Out
Table of Contents Motivation ..................................................................................................... 4 Test Methodology......................................................................................... 4 PowerEdge C4140-M Details ....................................................................... 6 Performance Results - Short Tests for Parameter Tuning ........................... 8 Performance Results - Long Tests Accuracy Convergence.......................
Motivation With the recent advances in the field of Machine Learning and especially Deep Learning, it’s becoming more and more important to figure out the right set of tools that will meet some of the performance characteristics for these workloads. Since Deep Learning is compute intensive, the use of accelerators like GPU become the norm, but GPUs are premium components and often it comes down to what is the performance difference between a system with and without GPU.
GPU’s Performance Metrics Dataset Environment ▪ 1-8 ▪ Throughput images/second ▪ Training to convergence at 76.
Figure 1: Servers Logical Design. Source: Image adapted from https://community.mellanox.com/docs/DOC-2971 Error! Reference source not found. 2 below shows how PowerEdge C4140-M is connected via InifniBand fabric for multi-node testing. Figure 2: Using Mellanox CX5 InfiniBand adapter to connect PowerEdge C4140 in multi-node configuration PowerEdge C4140-M Details The Dell EMC PowerEdge C4140, an accelerator-optimized, high density 1U rack server, is used as the compute node unit in this solution.
Volta SXM, use PCIe bridges and this limits the total available bandwidth between CPU to GPU. See Table 2 with the Host-GPU Complex PCIe Bandwidth Summary.
Performance Results - Short Tests for Parameter Tuning Below are the results for the short tests using TF 1.14. In this section we tested all the models in multi node mode and compared the results obtained with TF 1.10 in 2019. Throughput CNN Models TF 1.10 vs TF 1.14 The Figure 5 several CNN models comparing results with TF 1.10 vs TF 1.14. In Figure 6 we notice that the performance gain is about 1.08X (or 8%) between the two releases. Figure 5: Multi Node PowerEdge C4140-M – Several CNN Models TF 1.
Figure 6 Multi Node PowerEdge C4140-M – Several CNN Models TF 1.10 vs TF 1.14 (Speedup factor) Performance Gain with XLA Since there was not much performance gain with the basic configuration, we decided to explore the limits of GPU performance using other parameters. We looked at XLA (Accelerated Linear Algebra) [3], by adding the flag –xla=true at the script level.
Figure 7: Multi Node PowerEdge C4140-M. Several CNN Models TF 1.10 vs TF 1.14 + XLA Figure 8: Multi Node PowerEdge C4140-M. Several CNN Models TF 1.10 vs TF 1.
ResNet-50’s Performance with TF 1.14 + XLA In this section, we evaluated the performance of ResNet-50 model trained with TF 1.14 and TF 1.14 with XLA enabled. The tests were run with 1 GPU, 4 GPUs, and 8 GPUs and the results were compared with those obtained for version TF 1.10 from our previous paper [0]. Also, we explored the performance using batch size of 128 and 256. See Figure 9 and Figure 10. Figure 9: Multi Node PowerEdge C4140-M. ResNet-50 BS 128 TF 1.10 vs TF 1.14 vs TF 1.
Figure 10: Multi Node PowerEdge C4140-M. ResNet-50 BS 256 TF 1.10 vs TF 1.14 vs TF 1.14 + XLA ResNet-50 with TF 1.14 + XLA + GPUDirect RDMA Another feature explored in our previous paper was GPUDirect RDMA which provides a direct P2P (Peer-to-Peer) data path between GPU memory using a Mellanox HCA device between the nodes. In this test, we enabled it by adding the NCCL flag – x NCCL_NET_GDR_LEVEL=3 at the script level (this variable replaced the variable NCCL_IB_CUDA_SUPPORT in NCCL v 2.4.0).
Figure 11: Multi Node PowerEdge C4140-M. ResNet-50 with TF 1.14 + XLA + GPUDirect RDMA Figure 11 shows the results of ResNet-50 with TF 1.14 w/XLA enabled, with and without GPUDirect RDMA. We did not observe much performance gains using GPUDirect RDMA across nodes i.e. the performance remained the same and hence we did not explore it further in our testing.
Figure 12: Multi Node PowerEdge C4140-M - ResNet-50’s Configuration for Best Performance ResNet-50’s Scale-out PowerEdge C4140 using Nvidia 4x NVLink architecture scales relatively well when using Uber Horovod distributed training library and Mellanox InfiniBand as the high-speed link between nodes. It scales ~3.9x times within the node and ~6.9x using scale-out for ResNet-50 with batch size 256. See Figure 13.
Figure 13: Multi Node PowerEdge C4140-M - ResNet-50’s Scale-out vs Scale-up Figure 14: Multi Node PowerEdge C4140-M vs Competitor 15 Deep Learning Performance Scale-Out
The above benchmarks shown in Figure 14 are done on 2 servers C4140 x4 V100 GPUs, each connected by Mellanox ConnectX-5 network adapter with 100Gbit/s over IPoIB. The Dell EMC distributed mode with Horovod achieved 85% of scaling efficiency for ResNet-50 batch size 256 compared with the ideal performance; on the other hand, it achieved 95% of scaling efficiency versus a test run by TF team on 2018 with a VM (virtual machine) instance on GCP(Google cloud) with 8x V100 GPUs and batch size=364 [5].
Conclusion 17 • The performance with TF 1.14 among the models was just slightly superior (~1%-8%) versus TF 1.10. On the other hand, TF 1.14 with XLA boosted the performance up to ~46% among the models ResNet-50, Inception-v3 and Inception-v4. • In the case of ResNet-50 model, its performance improved up to ~ 3% with TF 1.14, and up to ~46% with TF 1.14 and XLA enabled. ResNet-50 batch size 256 scaled better (1.46X) versus ResNet-50 BS 128 (1.35X).
Server Features 18 Deep Learning Performance Scale-Out
Citation @article {sergeev2018horovod, Author = {Alexander Sergeev and Mike Del Balso}, Journal = {arXiv preprint arXiv: 1802.05799}, Title = {Horovod: fast and easy distributed deep learning in {TensorFlow}}, Year = {2018} } References 19 • [0] https://downloads.dell.com/manuals/allproducts/esuprt_solutions_int/esuprt_solutions_int_solutions_resources/serverssolution-resources_white-papers52_en-us.pdf • [1] Horovod GitHub, “Horovod Distributed Deep Learning Training Framework” [Online].
Appendix: Reproducibility The section below walks through the setting requirements for the distributed Dell EMC system and execution of the benchmarks.