Administrator Guide

4 Deep Learning Performance Scale-Out

Motivation

With the recent advances in the field of Machine Learning and especially Deep Learning, it’s

becoming more and more important to figure out the right set of tools that will meet some of the

performance characteristics for these workloads. Since Deep Learning is compute intensive, the

use of accelerators like GPU become the norm, but GPUs are premium components and often it

comes down to what is the performance difference between a system with and without GPU. In

that sense Dell EMC is constantly looking to support the business goals of customers by building

highly scalable and reliable infrastructure for Machine Learning/Deep Learning workloads and

exploring new solutions for large scale distributed training to optimize the return on investment

(ROI) and Total Cost of Ownership (TCO).

Test Methodology

We have classified TF benchmark tests in two categories: short and long tests. During the

development of the short tests, we experimented with several configurations to determine the one

that yielded the highest throughput in terms of images/second, then we selected that configuration

to run the long tests to reach certain accuracy targets.

Short Tests

The tests consisted of 10 warmup steps and then another 100 steps which were averaged to get

the actual throughput. The benchmarks were run with 1 NVIDIA GPU to establish a baseline

number of images/sec and then increasing the number of GPUs to 4 and 8. These tests allow us

to experiment with the parameter tuning of the models in distributed mode.

Long Tests

The tests were run using 90 epochs as the standard for ResNet50. This criterion was used to

determine the total training time on C4140-M servers in distributed mode with the best parameter

tuning found in the short tests and using the maximum number of GPUs supported by the system.

In the section below, we describe the setup used, and Table 1 gives an overall view on the test

configuration.

Testing Setup

Commercial Application

Computer Vision - Image classification

Benchmarks code

▪ TensorFlow Benchmarks scripts

Topology

▪ Single Node and Multi Node over InfiniBand

Server

▪ PowerEdge C4140-M (4xV100-16GB-SXM2)

Frameworks

▪ TensorFlow with Horovod library for Distributed Mode[1]

Models

▪ Convolutional neural networks: Inception-v4, vgg19,

vgg16, Inception-v3, ResNet-50 and GoogLeNet

Batch size

▪ 128-256