Administrator Guide
4 Deep Learning Performance Scale-Out
Motivation
With the recent advances in the field of Machine Learning and especially Deep Learning, it’s
becoming more and more important to figure out the right set of tools that will meet some of the
performance characteristics for these workloads. Since Deep Learning is compute intensive, the
use of accelerators like GPU become the norm, but GPUs are premium components and often it
comes down to what is the performance difference between a system with and without GPU. In
that sense Dell EMC is constantly looking to support the business goals of customers by building
highly scalable and reliable infrastructure for Machine Learning/Deep Learning workloads and
exploring new solutions for large scale distributed training to optimize the return on investment
(ROI) and Total Cost of Ownership (TCO).
Test Methodology
We have classified TF benchmark tests in two categories: short and long tests. During the
development of the short tests, we experimented with several configurations to determine the one
that yielded the highest throughput in terms of images/second, then we selected that configuration
to run the long tests to reach certain accuracy targets.
Short Tests
The tests consisted of 10 warmup steps and then another 100 steps which were averaged to get
the actual throughput. The benchmarks were run with 1 NVIDIA GPU to establish a baseline
number of images/sec and then increasing the number of GPUs to 4 and 8. These tests allow us
to experiment with the parameter tuning of the models in distributed mode.
Long Tests
The tests were run using 90 epochs as the standard for ResNet50. This criterion was used to
determine the total training time on C4140-M servers in distributed mode with the best parameter
tuning found in the short tests and using the maximum number of GPUs supported by the system.
In the section below, we describe the setup used, and Table 1 gives an overall view on the test
configuration.
Testing Setup
Commercial Application
Computer Vision - Image classification
Benchmarks code
▪ TensorFlow Benchmarks scripts
Topology
▪ Single Node and Multi Node over InfiniBand
Server
▪ PowerEdge C4140-M (4xV100-16GB-SXM2)
Frameworks
▪ TensorFlow with Horovod library for Distributed Mode[1]
Models
▪ Convolutional neural networks: Inception-v4, vgg19,
vgg16, Inception-v3, ResNet-50 and GoogLeNet
Batch size
▪ 128-256