White Papers

42 CheXNet – Inference with Nvidia T4 on Dell EMC PowerEdge R7425 | Document ID

Table 8. Throughput with TensorRT™ at ~7ms Latency Target

Inference Mode

Batch Size

Thoughput

(img/sec)

Latency (ms)

TensorFlow-FP32-CPU Only

114.9*

TensorFlow-FP32-GPU

142

7.1

TF-TRT5 Integration FP32

272

7.6

TF-TRT5 Integration FP16

656

6.3

TF-TRT5 Integration INT8

1281

6.6

TensorRT™ C++ API INT8

1371

5.8

Figure 20. Throughput with TensorRT™ at ~7ms Latency Target

From Table 8 and Figure 20 above, we can observe:

• Native TensorFlow FP32 without TensorRT™ (batch size=1) inference ran on CPU-Only

(AMD EPYC 7551 32-Core Processor) performed 9 img/sec with the minimal latency of ~115

ms. It is a referenceable measurement that shows the different using CPU Only based systems

versus GPU based systems.

• The same Native TensorFlow FP32 without TensorRT™ (batch size=1) inference ran on

GPU performed 142 img/sec at ~7ms latency target. It means ~16X faster than CPU Only (142

vs 9). Now let us use this configuration as a landmark to benchmark the optimized inferences

with TensorRT™.