Whitepaper Addressing the Memory Bottleneck in AI Model-Training for Healthcare Key Takeaways ▪ Near-terabyte memory footprint in 3D model training Executive Summary Intel, Dell, and researchers at the University of Florida have collaborated to help data scientists optimize the analysis of healthcare data sets using artificial ▪ 3.4x speedup with Deep Neural Network Library optimizations intelligence (AI).
Revisions Date Description 1/9/2020 Initial release Acknowledgements This paper was produced by the following: Name Bhavesh Patel Dell EMC David Ojika PhD, University of Florida G Anthony Reina MD, Intel Trent Boyer Intel Chad Martin Intel Prashant Shah Intel 2 Addressing the Memory Bottleneck in AI Model – Training for Healthcare
Table of Contents Motivation ..................................................................................................... 4 Multimodal Brain Tumor Analysis ................................................................. 4 Computing Challenges ................................................................................. 5 Experimental Data........................................................................................ 6 3D U-Net Model ...................................................
Motivation Healthcare data sets often consist of large, multi-dimensional modalities. Deep learning (DL) models developed from these data sets require both high accuracy and high confidence levels to be useful in clinical practice. Researchers employ advanced hardware and software to speed up this both data- and computation-intensive process.
replacing current assessments with highly accurate and reproducible measurements, AI and DL techniques can automatically analyze brain tumor scans, providing an enormous potential for improved diagnosis, treatment planning and patient follow-ups. A typical MRI scans of the brain may contain 4D volumes with multimodal, multisite MRI data (FLAIR, T1w, T1gd, T2w).
• Image size: Images are often down sampled to a lower resolution • Batch size: Batch sizes are often reduced to one or two images • Tiling/Patching: Images are often subsampled into overlapping tiles/patches • Model Complexity: Reductions in the number of feature maps and/or layers are often necessary • Model Parallelism: Models may be distributed across several compute nodes in a parallel fashion Although these tricks have been used to produce clinically-relevant models, we believe that research
3D U-Net Model Convolutional neural networks (CNNs) such as U-Net have been widely successfully in 2D segmentation in computer vision problems [6]. However, most medical data used in clinical practice consists of 3D volumes. Since only 2D slices can be displayed on a computer screen, annotating these large volumes with segmentation labels in a slice-by-slice manner is cumbersome and inefficient.
Table 1. Memory requirement for training 3D U-Net. Image size 128x128x128 Batch Training Server Server size outcome system memory CPU family 16 Fail 192 GB 1st Server tag Generation Xeon Intel dev server Scalable Processor 144x144x144 8 Success 384 GB 1st Generation Xeon Intel standard server Scalable Processor 240x240x144 16 - 1.5 TB 2nd Generation Xeon Intel memory-rich Scalable server Processor Table 2. Provisioning training infrastructure for 3D U-Net.
We overcame this memory bottleneck on our development server by reducing the training batch size from 16 down to 2, while reducing the image size to reasonably smaller sized dimensions instead of the full-scale image feature map (240x240x144). Of course, this has an impact on the model accuracy and convergence time. Next, we upgraded our server’s system memory to its maximum supported memory capacity (384 GB), increased the image size of the dataset to about one-half, but reduced the batch size by half.
Training 3D U-Net on a Large-Memory System A single-node server with large memory has the potential to reduce organization’s total cost of ownership (TCO), while addressing the memory bottleneck involved with training large models with complex datasets. Using a 4-socket 2nd Generation Intel Xeon Scalable Processor system on a Dell EMC PowerEdge server equipped with 1.
was 30 seconds per image, a 3.4x speedup (Figure 6) compared to stock TensorFlow (without DNNL) at the same training batch size of 16. Figure 7 depicts the prediction performance of the trained model. As observed, the segmentation mask from the model predictions closely match the ground truth mask.
Conclusions In this white paper, we presented the multimodal brain tumor analysis for medical diagnosis, highlighted the computing challenges, and presented the 3D U-Net model for the task of volumetric image segmentation. We pre-calculated the memory requirement of the model and analyzed 3 different server configurations with varying memory capacity: from a “dev server” with 192 GB of memory to a “memory-rich” server with over 1 TB of memory.
References [1] Swarnendu Ghosh, Nibaran Das, Ishita Das, and Ujjwal Maulik. 2019. Understanding Deep Learning Techniques for Image Segmentation. ACM Comput. Surv. 52, 4, Article 73 (August 2019), 35 pages. [2] Holland, E n.d., ‘Progenitor cells and glioma formation’, Current Opinion in Neurology, vol. 14, no. 6, pp. 683–688. [3] B. H. Menze et al., "The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS)," in IEEE Transactions on Medical Imaging, vol. 34, no. 10, pp. 1993-2024, Oct. 2015.
Appendix: Reproducibility Software Data Model Keras 2.2 TensorFlow 1.11 DNNL Python 3.6 Anaconda 3 Conda 4.6 Ubuntu 16.04 Dataset name: BRATS Tensor image size: 4D Train, val, test images: 406, 32, 46 Dataset license: CC-BY-SA 4.0 Release: 2.0 04/05/2018 Dataset source: Architecture: 3D U-Net Input format: Channels last Params: 5,650,801 Trainable params: 5,647,857 Non-trainable params: 2,944 Code repository: https://www.med.upenn.edu/sbia/brats2017.html https://github.
FTC Disclaimer: For Performance Claims Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary.