Deep Learning using iAbra stack on Dell EMC PowerEdge Servers with Intel technology Abstract This whitepaper evaluates the performance and efficiency of running Deep Learning training and inference using iAbra framework on the Dell EMC PowerEdge C6420 server. The objective of this whitepaper is to demonstrate how iAbra on PowerEdge server infrastructure has a new approach to solving both training & inference using heterogeneous environment.
Revisions Revisions Date Description June 19 Initial release Acknowledgements This paper was produced by the following members of the Dell EMC SIS team, iAbra and Intel: Authors: Bhavesh Patel [Dell EMC], Greg Compton [iAbra], Greg Nash [Intel PSG] The information in this publication is provided “as is.” Dell Inc.
Table of contents Revisions.............................................................................................................................................................................2 Acknowledgements .............................................................................................................................................................2 Executive summary.....................................................................................................................
Executive summary Executive summary Why Enterprise needs solutions to business problems from deep learning not more technical challenges. Many challenges exist in providing an integrated platform for deep learning development especially in the IoT era. From semiconductor parts to interconnects and servers through to software libraries and domain-centric data science every layer in the system has important consequences to overall solution business effectiveness.
1 Overview of Deep Learning Deep learning consists of two phases: Training and inference. As illustrated in Figure 1, training involves learning a neural network model from a given training dataset over a certain number of training iterations and loss function [1]. The output of this phase, the learned model, is then used in the inference phase to speculate on new data.
1.1 Deep Learning Inferencing After a model is trained, the generated model may be deployed (forward propagation only) e.g., on FPGAs, CPUs or GPUs to perform a specific business-logic function or task such as identification, classification, recognition and segmentation. [Figure 2].
2 Why iAbra? 2.1 Introduction to PathWorks PathWorks Flow Diagram The use of FPGA in both the data center and embedded / IoT applications for AI inference can deliver greater efficiency in terms of silicon size, weight and power (SWaP). However, the exploitation of the inherent benefits that the FPGA can deliver in terms of SWaP requires that the machine learnt models to be inferred can “fit” into the resources available on the target FPGA.
2.2 iAbra AI Use Case Qualification Embedded Use Case Characteristics ○ Low Size, Weight and Power ○ Type Approval or ○ Plan of Record Compliant ○ Low Failure Rates ○ Environmental Survival ○ security Network Creation Use Case Characteristics ○ Embedded End Use Case ○ Abstracted Training Platform ○ Non-Data Scientist Users ○ Reduced time to Solution ○ Smaller more targeted networks (e.g. sub 1000 neurons) ○ Training to Inference Fidelity 2.2.
• • The NN is then optimised for power and performance using the training data– this can be done in the datacentre on the Intel ARRIA10 PAC card or an embedded system As the next is compact training times are relatively short and the end result is a highly optimised NN that can be tested and run in the datacentre or an embedded system Short development time is a key benefit but also the fact that the end result is easier to implement and more efficient in an embedded system than when using other developme
iAbra Neuron implementation 2.3 Application areas for PathWorks PathWorks provides the ability to structure unstructured high dimensional data in a low SWaP (size, weight and power) footprint for custom problem domains. Generically this makes it ideal for solutions with the following demands: ● ● ● ● ● ● ● 2.3.
3 Why the PowerEdge C6420 server? 3.1 Overview C6420 The PowerEdge C6420 optimizes compute, processors, memory and large volume local storage for an IT services platform that can be efficiently and predictably scaled, while drastically reducing complexity. With up to 4 independent hot-swappable 2-socket servers in a very dense 2U chassis, servers can be easily repurposed as workloads change.
4 Why FPGA? FPGA (Field Programmable Gate Arrays) allow a blank slate to build the solution that fits the problem best instead of fitting a solution into a predefined architecture. FPGAs provide flexibility for AI system architects searching for competitive deep learning accelerators that also support differentiating customization.
4.1 FPGAs in Mission-Critical Applications Mission-critical applications (e.g., autonomous vehicle, defense and intelligence, manufacturing, smart agriculture, smart cities, etc.) require deterministic low-latency. The data flow pattern in such applications may be in streaming form, requiring pipelined-oriented processing. FPGAs are excellent for these kinds of use cases given their support for fine-grained, bit level operations in comparison to CPU and GPUs.
4.2 Intel Programmable Acceleration Card The Intel Programmable Accelerator Card (PAC) features an Intel Arria® 10 FPGA, an industry-leading programmable logic built on 20 nm process technology, integrating a rich feature set of embedded peripherals, embedded high-speed transceivers, hard memory controllers and IP protocol controllers.
5 Reference Setup of PowerEdge C6420 Features C6420 Chassis PSU 2x1600W PSU CPU Memory Networking Accelerator OS Storage TOR 15 Sled 1- 4 Skylake 6148 20C 2.4GHz 256GB Memory (16G DIMM) OPA 100Gbps Intel PAC Aria10 FPGA adapter Rhel 7.4 2x1.
6 High level data flow for performance testing 6.
6.1.1 CPU & FPGA utilization (Average during training time) The iAbra application has been tuned to balance the load across CPU and FPGA resulting in utilizations of 82% on the CPU(40 cores) and 94% on the FPGA. 6.1.2 Top 1% accuracy & Average individual run time The iAbra application developed a network and weights resulting in 96.2% accuracy in 42mins. The figure above shows the network convergence over time.