CheXNet – Inference with Nvidia T4 on Dell EMC PowerEdge R7425 Abstract This whitepaper looks at how to implement inferencing using GPUs. This work is based on CheXNet model developed by Stanford University to detect pneumonia. This paper describes the utilization of trained model and TensorRT™ to perform inferencing using Nvidia T4 GPUs.
Revisions Revisions Date Description June 2019 Initial release Acknowledgements This paper was produced by the following members of the Dell EMC SIS team: Authors: Bhavesh Patel, Vilmara Sanchez [Dell EMC Advanced Engineering] Support: Josh Anderson [Dell EMC System Engineer] Others: Nvidia account team for their expedited support Nvidia Developer forum TensorFlow -TensorRT Integration Forum Dell EMC HPC Engineering team {Lucas A.
Table of contents Revisions.............................................................................................................................................................................2 Acknowledgements .............................................................................................................................................................2 Executive summary.....................................................................................................................
Executive summary Executive summary The Healthcare industry has been one of the leading-edge industries to adopt techniques related to machine learning and deep learning to improve diagnosis, provide higher level of accuracies in term of detection and reduce overall cost related to mis-diagnosis. Deep Learning consists of two phases: training and inference. Training involves learning a neural network model from a given training dataset over a certain number of training iterations and loss function.
1 Background & Definitions Deploying AI applications into production sometimes requires high throughput at the lowest latency. The models generally are trained in 32-bit floating point (fp32) precision mode but need to be deployed for inference at lower precision mode without losing significant accuracy. Using lower bit precision like 8-bit integer (int8) gives higher throughput because of low memory requirements.
TensorFlow™ is an open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices.
CheXNet is a Deep Learning based model for Radiologist-Level Pneumonia Detection on Chest X-Rays, developed by the Stanford University ML Group and trained on the Chest XRay dataset. For the pneumonia detection, the ML group have labeled the images that have pneumonia as the positive examples and labeled all other images with other pathologies as negative examples.
2 Test Methodology In this project we ran image classification inference for the custom model CheXNet on the PowerEdge R7425 server in different precision modes and software configurations: with TensorFlow-CPU support only, TensorFlow-GPU support, TensorFlow with TensorRT™, and native TensorRT™. Using different settings, we were able to compare the throughput and latency and expose the capacity of PowerEdge R7425 server when running inference with Nvidia TensorRT™.
Table 1 shows the summary of the project design below: Element Use Case: Models: Framework: TensorRT™ version: TensorRT™ implementations: Precision Modes: Performance: Dataset: Samples code: Software stack configuration: Server: Table 1:Project Design Summary Description Optimized Inference Image Classification with TensorFlow and TensorRT™ Custom Model CheXNet and base model ResnetV2_50 TensorFlow 1.0 TensorRT™ 5.
2.2 Test Setup a) For the hardware, we selected PowerEdge 7425 which includes the Nvidia Tesla T4 GPU, the most advanced accelerator for AI inference workloads. According to Nvidia, T4’s new Turing Tensor cores accelerate int8 precision more than 2x faster than the previous generation lowpower offering [2].
3 Development Methodology In this section we explain the general instructions on how we trained the custom model CheXNet from scratch with TensorFlow framework using transfer Learning, and how the trained model was optimized then with TensorRT™ to run accelerated inferencing. 3.
'Effussion', 'Hernia', 'Nodule', 'Pneumonia', 'Atelectasis', 'PT', 'Mass', 'Edema', 'Consolidation', 'Infiltration', 'Fibrosis', 'Pneumothorax'] Build a Convolutional Neural Network using Estimators: Here we describe the building process of the CheXNet model with Transfer Learning using Custom Estimator. We used the high-level TensorFlow API tf.
Figure 6: Overview of the Estimator Interface [5] See the Table 4 with the Estimator’s methods and modes to call train, evaluate, or predict. The Estimator framework invokes the model function with the mode parameter set as follows: Table 4. Implement training, evaluation, and prediction. Source [4] Create the Estimator: Chexnet_classifier = tf.estimator.Estimator( model_fn=model_function, model_dir=FLAGS.model_dir, config=run_config, params={ 'densenet_depth': FLAGS.densenet_depth, 'data_format': FLAGS.
In this case, the architecture of an existing official network was used as base model (resnet_v2_50). The output of the model is defined by a layer with 14 neurons to predict each class. Since X-ray images can show more than one pathology, the model should also detect multiple classifications; to do so, we used the sigmoid activation function. See the snippet code below: def model_fn(features, labels, mode, params): tf.summary.image('images', features, max_outputs=6) model = resnet_model.
Figure 7. Subsequent calls to train(), evaluate(), or predict(). Source [7] Variable Scope: When building the custom model, it’s important to create it placing the variables under the same variable scope as the checkpoints; otherwise, the system will output errors similar to “tensorbatch_normalization/beta is not found in resnet_v2_imagenet_checkpoint”. Variable scopes allow you to control variable reuse when calling functions which implicitly create and use variables.
We need to provide the export_output argument to the EstimatorSpec, it defines signatures for TensorFlow serving prediction = { 'categories': tf.argmax(logits, axis=1, name='categories'), 'scores': tf.sigmoid(logits, name='chexnet_sigmoid_tensor') } if mode == tf.estimator.ModeKeys.PREDICT: # Return the predictions and the specification for serving a SavedModel return tf.estimator.EstimatorSpec( mode=mode, predictions=prediction, export_outputs={ 'predict': tf.estimator.export.PredictOutput(prediction) 3.
input_fn=lambda: input_fn( True, FLAGS.data_dir, FLAGS.batch_size, FLAGS.epochs_per_eval)) Evaluate the model and print results: eval_results = chexnet_classifier.evaluate( input_fn=lambda: input_fn(False, FLAGS.data_dir, FLAGS.batch_size)) lr = reduce_lr_hook.update_lr(eval_results['loss']) print (eval_results) 3.
graph = tf.Graph() with tf.Session(graph=graph) as sess: tf.saved_model.loader.load(sess, meta_graph.meta_info_def.tags, savedmodel_dir) frozen_graph_def = tf.graph_util.convert_variables_to_constants(sess, graph.as_graph_def(), output_node_names= ["chexnet_sigmoid_tensor", "categories"]) #remove the unnecessary training nodes cleaned_frozen_graph = tf.graph_util.
4 Inference with TensorRT™ NVIDIA TensorRT™ is a C++ based library aimed to perform high performance inference on GPUs. After a model is trained and saved, TensorRT™ transforms it by applying graph optimization and layer fusion for a faster implementation, so it can be deployed in an inference context. TensorRT™ provides three tools to optimize the models for inference: TensorFlow-TensorRT integrated (TF-TRT), TensorRT C++ API, and TensorRT Python API.
Further, the model needs to be built with supported operations by TF-TRT integrated, otherwise the system will output errors for unsupported operations. See the reference list for further description [13]. Figure 8: Workflow for Creating a TensorRT Inference Graph from a TensorFlow Model in Frozen Graph Format Import the library TensorFlow-TensorRT Integration: import tensorflow.contrib.
output_names = [node.split(":")[0] for node in outputs] graph = tf.Graph() with tf.Session(graph=graph) as sess: tf.saved_model.loader.load( sess, meta_graph.meta_info_def.tags, savedmodel_dir) frozen_graph_def = tf.graph_util.convert_variables_to_constants( sess, graph.
max_batch_size=batch_size, max_workspace_size_bytes=workspace_size<<20, precision_mode=precision_mode) write_graph_to_file(graph_name, trt_graph, output_dir) return trt_graph Create and save GraphDef for the TensorRT™ inference using TensorRT™ library (optional INT8 ): “Convert a TensorRT™ graph used for calibration to an inference graph “ def get_trt_graph_from_calib(graph_name, calib_graph_def, output_dir): trt_graph = trt.
--image_file=image.jpg \ --int8 \ --output_dir=/home/chest-x-ray/output_tensorrt_chexnet_1541777429/ \ --batch_size=1 \ --input_node="input_tensor” \ --output_node="chexnet_sigmoid_tensor" Where: --savedmodel_dir: The location of a saved model directory to be converted into a Frozen Graph --image_file: The location of a JPEG image that will be passed in for inference --int8: Benchmark the model with TensorRT™ using int8 precision --output_dir: The location where output files will be saved --batch_size: Bat
Files used for development: 4.2 Script: Base model script: tensorrt_chexnet.py tensorrt.py Labels file labellist_chest_x_ray.json TensorRT™ using TensorRT C++ API In this section, we present how to run optimized inferences with an existing TensorFlow model using TensorRT C++ API.
Converting A Frozen Graph To UFF: An existing model built with TensorFlow can be used to build a TensorRT™ engine. Importing from the TensorFlow framework requires to convert the TensorFlow model into the intermediate format UFF file. To do so, we used the tool convert_to_uff.py located at the directory /usr/lib/python3.5/dist-packages/uff/bin, which uses as an input a frozen model, below the command to convert .pb TensorFlow frozen graph to .uff format file: python3 convert_to_uff.py \ input_file /home/che
the best possible performance; and if not, it is recommended to convert it to CHW. Overall, CHW is generally better for GPUs, while HWC is generally better for CPUs. [6] Build the Optimized Runtime Engine in fp16 or iInt8 mode (calibration optional for INT8int8 inference) [15]: //Configure the builder builder->setMaxBatchSize(gParams.batchSize); builder->setMaxWorkspaceSize(gParams.workspaceSize << 20); //To run in fp16 mode if (gParams.fp16) { builder->setFp16Mode(gParams.
sample input. TensorRT™ will then perform inference in fp32 and gather statistics about intermediate activation layers that it will use to build the reduce precision int8 engine. When the engine is built, TensorRT™ makes copies of the weights. The TensorRT™ network definition contains pointers to model weights, the builder copies the weights into the optimized engine, and the parser will own the memory occupied by the weights; the parser object is then deleted after the builder has run for inference.
int outputIndex = engine.getBindingIndex(“CheXNet_sigmoid_tensor”); //3-Set up a buffer array pointing to the input and output buffers on the GPU, using the indexes: void* buffers[2]; buffers[inputIndex] = inputbuffer; buffers[outputIndex] = outputBuffer; //4-TensorRT™ execution is typically asynchronous, so enqueue the kernels on a CUDA stream: context.enqueue(batchSize, buffers, stream, nullptr): Command line to execute the trtexec file: .
Script Output sample: On completion, the script prints overall metrics and timing information over the inference session Average Average Average Average Average Average Average Average Average Average Average over over over over over over over over over over over 100 100 100 100 100 100 100 100 100 100 100 runs runs runs runs runs runs runs runs runs runs runs is is is is is is is is is is is 1.44041 ms (host walltime is 1.56217 ms, 99% percentile time is 1.52326). 1.43143 ms (host walltime is 1.
5 Results 5.1 CheXNet Inference - Native TensorFlow FP32fp32 with CPU Only Benchmarks ran with batch sizes 1-32 using native TensorFlow FP32fp32 with CPU-Only (AMD EPYC 7551 32-Core Processor). Tests conducted using the docker image TensorFlow:1.10.0-py3 Figure 10: CheXNet Inference - Native TensorFlow FP32 with CPU-Only. AMD EPYC 7551 32-Core Command line to execute the benchmark: python3 tensorrt_chest.py --savedmodel_dir=/home/dell/chest-x-ray/chexnet_saved_model/1541777429/ \ --image_file=image.
Where: --native: Benchmark model with it's native precision FP32 and without TensorRT™. Script Output sample: ========================== network: native_frozen_graph.pb, batchsize 1, steps 100 fps median: 9.2, mean: 9.1, uncertainty: 0.1, jitter: 0.3 latency median: 0.10912, mean: 0.11459, 99th_p: 0.23157, 99th_uncertainty: 0.18079 ========================== • • 5.2 Throughput (images/sec): ~9 Latency (sec): 0.
Command line to execute the benchmark: python3 tensorrt_chest.py --savedmodel_dir=/home/dell/chest-x-ray/chexnet_saved_model/1541777429/ \ --image_file=image.jpg \ --native \ --output_dir=/home/dell/chest-x-ray/output_tensorrt_chexnet_1541777429/ --batch_size=1 Docker image for TensorFlow-GPU: nvcr.io/nvidia/tensorflow:18.10-py3 Where: --native: Benchmark Script Output sample: model with it's native precision FP32 and without TensorRT™. ========================== network: native_frozen_graph.
Figure 12. CheXNet Inference –TF-TRT 5.0 Integration in INT8int8 Precision Mode Command line to execute the benchmark: python3 tensorrt_chest.py --savedmodel_dir=/home/dell/chest-x-ray/chexnet_saved_model/1541777429/ \ --image_file=image.jpg \ --int8 \ --output_dir=/home/dell/chest-x-ray/output_tensorrt_chexnet_1541777429/ --batch_size=1 Docker image for TensorFlow-GPU: nvcr.io/nvidia/tensorflow:18.
• • 5.4 Throughput (images/sec): ~315 Latency (sec): 0.00704*1000 = ~3 Benchmarking CheXNet Model Inference with Official ResnetV2_50 To benchmark our custom model CheXNet with a well-known model, we replicated the same inference tests TF-TRT-INT8 Integration using the official pre-trained version of the ResNet50 v2 model (fp32, Accuracy 76.47%) [6].
Figure 14. Latency CheXNet TF-TRT-INT8int8 versus ResnetV2_50 TF-TRT-INT8int8 Inference 5.5 CheXNet Inference - Native TensorFlow FP32fp32 with GPU versus TF-TRT 5.0 INT8 After confirming that our custom model performed well compared to the optimized inference TF-TRT of an official model, we proceeded in this section to compare the CheXNet inference model itself in different configurations. In the Error! Reference source not found.
Figure 15. Throughput Native TensorFlow FP32 versus TF-TRT 5.0 Integration INT8 Figure 16 shows the latency curve for each inference configuration, the lower is the latency better is the performance, and in this case TF-TRT-INT8 implementation reached the lowest inference time for all the batch sizes.
Figure 16. Latency Native TensorFlow FP32fp32 (CPU / GPU) versus TF-TRT 5.
See the Table 5 with the consolidated results of the CheXNet Inference in Native TensorFlow FP32 mode versus TF-TRT 5.0 Integration INT8int8, in terms of throughput and latency. We observed the huge different when running the test in different configurations. For speedup factors see the next tables. Table 5. Throughput and Latency Native TensorFlow FP32 versus TF-TRT 5.
See Figure 17 the R7425-T4-16GB speedup Factor with TF-TRT versus Native TensorFlow Figure 17: Speedup Factor with TF-TRT versus Native TensorFlow 5.6 CheXNet Inference - TF-TRT 5.0 Integration vs Native TRT5 C++ API We wanted to explore further and optimized the CheXNet inference using the TensorRT C++ API with the sample tool trtexec provided by Nvidia. This sample is very useful for generating serialized engines and can be used as a template to work with our custom models.
Figure 18:Throughput TF-TRT 5.0 Integration vs Native TRT5 C++ API Figure 19: Latency TF-TRT 5.
Command line to execute the Native TensorRT™ C++ API benchmark: ./trtexec --uff=/home/dell/chest-x-ray/output_convert_to_uff/chexnet_frozen_graph_1541777429.
Table 8. Throughput with TensorRT™ at ~7ms Latency Target Thoughput Inference Mode Batch Size Latency (ms) (img/sec) TensorFlow-FP32-CPU Only 1 9 114.9* TensorFlow-FP32-GPU 1 142 7.1 TF-TRT5 Integration FP32 2 272 7.6 TF-TRT5 Integration FP16 4 656 6.3 TF-TRT5 Integration INT8 8 1281 6.6 TensorRT™ C++ API INT8 8 1371 5.8 Figure 20.
• Using TF-TRT-FP32 with TensorRT™ (batch size=2) instead of Native TensorFlow FP32 without TensorRT™, improved throughput by ~92% (272 vs 142) at ~7ms latency target. • Using TF-TRT-FP16 with TensorRT™ (batch size=4) improved throughput by ~362% (656 vs 142). Also, it decreases latency by ~11%, making it in 6.3ms versus 7.1ms.
6 44 Conclusion and Future Work • Dell EMC offers an excellent solution with its PowerEdge R7425 server based on Nvidia T4 GPU to accelerate Artificial Intelligent workloads, including high-performance Deep learning inference boosted with the Nvidia TensorRT™ library. • The Native TensorFlow fp32 (without TensorRT™) inference on PowerEdge R7425-T4-16GB server speedup ~16X faster than CPU Only (AMD EPYC 7551 32-Core Processor).
A Troubleshooting In this section we describe the main issues we faced implementing the custom model CheXNet with Nvidia TensorRT™ and how we solved these: 45 • TensorRT™ installation. For TF-TRT integration, recommended to work with the docker image nvcr.io/nvidia/tensorflow:-py3. For Native TRT, recommended to work with the docker image nvcr.io/nvidia/TensorRT™:-py3. • Python path to TF models.
variable scope as it was in the restored checkpoints. Solution: we customized official TensorFlow base script resnet_model.py and placed the variables in the same variable scope name “resnet_model” as it was in the official checkpoints downloaded previously. So, we added this code line in the model function with tf.variable_scope("resnet_model"):. For more information see What's the difference of name scope and a variable scope in TensorFlow [8]. 46 • TensorFlow Serving for Inference.
B References [1] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya et al., “CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning,” arXiv preprint arXiv:1711.05225, 2017 [Online]. Available: https://arxiv.org/abs/1711.05225 [2] Nvidia News Center, ”NVIDIA AI Inference Performance Milestones: Delivering Leading Throughput, Latency and Efficiency” [Online]. Available: https://news.developer.nvidia.
[14] TensorFlow Guide, “A Tool Developer's Guide to TensorFlow Model Files”. [Online]. Available: https://www.tensorflow.org/guide/extend/model_files#freezing [15] Nvidia, “Working with TensorRT™ Using The C++ API, TensorRT™ Developer Guide”. [Online]. Available: https://docs.nvidia.com/deeplearning/sdk/TensorRT™-developerguide/index.html#c_topics [16] Nvidia, “Importing a TensorFlow Model Using The C++ UFF Parser API”. [Online]. Available: https://docs.nvidia.
C Appendix - PowerEdge R7425 – GPU Features Server CPU R7425-T4 CPU model AMD EPYC 7551 32-Core Processor GPU model GPU Architecture Attached GPUs Tesla T4-16GB NVIDIA Turing 6 Driver Version Compute Capability 410.79 7.5 Multiprocessors (MP) CUDA Cores/MP CUDA Cores Clock Rate (GHz) 40 64 2,560 1.