White Papers

42 CheXNet Inference with Nvidia T4 on Dell EMC PowerEdge R7425 | Document ID
Table 8. Throughput with TensorRT™ at ~7ms Latency Target
Inference Mode
Batch Size
Thoughput
(img/sec)
Latency (ms)
TensorFlow-FP32-CPU Only
1
9
114.9*
TensorFlow-FP32-GPU
1
142
7.1
TF-TRT5 Integration FP32
2
272
7.6
TF-TRT5 Integration FP16
4
656
6.3
TF-TRT5 Integration INT8
8
1281
6.6
TensorRT™ C++ API INT8
8
1371
5.8
Figure 20. Throughput with TensorRT™ at ~7ms Latency Target
From Table 8 and Figure 20 above, we can observe:
Native TensorFlow FP32 without TensorRT™ (batch size=1) inference ran on CPU-Only
(AMD EPYC 7551 32-Core Processor) performed 9 img/sec with the minimal latency of ~115
ms. It is a referenceable measurement that shows the different using CPU Only based systems
versus GPU based systems.
The same Native TensorFlow FP32 without TensorRT™ (batch size=1) inference ran on
GPU performed 142 img/sec at ~7ms latency target. It means ~16X faster than CPU Only (142
vs 9). Now let us use this configuration as a landmark to benchmark the optimized inferences
with TensorRT™.