Dell DGX-2 User manual

H18079

Technical White Paper

Dell EMC Isilon, PowerSwitch, and NVIDIA DGX-2 Systems for Deep Learning

This document demonstrates how the Dell EMC Isilon F800 all-flash scale-out NAS, Dell EMC™ PowerSwitch™ S5232F-ON 100 Gbps switches and NVIDIA® DGX-2™ systems with NVIDIA Tesla® V100 GPUs can be used to accelerate and scale deep learning training workloads. The results of industry-standard

image classification benchmarks using TensorFlow are included.

February 2020

2 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.

Use, copying, and distribution of any software described in this publication requires an applicable software license.

This document may contain certain words that are not consistent with Dell's current language guidelines. Dell plans to update the document over subsequent future releases to revise these words accordingly.

This document may contain language from third party content that is not under Dell's control and is not consistent with Dell's current guidelines for Dell's own content. When such third party content is updated by the relevant third parties, this document will be revised accordingly.

Copyright © 2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [3/9/2021] [Technical White Paper] [H18079]

3 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

Table of Contents

Revisions ................................................................................................................................................................................. 5

Executive summary ................................................................................................................................................................. 5

Audience ................................................................................................................................................................................. 5

Introduction .............................................................................................................................................................................. 5

Deep learning dataflow ........................................................................................................................................................... 5

Solution architecture ............................................................................................................................................................... 7

OVERVIEW ......................................................................................................................................................................... 7

STORAGE: DELL EMC ISILON F800 ................................................................................................................................. 7

Storage tiering .................................................................................................................................................................. 9

OneFS caching ................................................................................................................................................................ 9

Locks and concurrency .................................................................................................................................................... 9

NETWORKING: DELL EMC POWERSWITCH S5232F-ON SWITCH ............................................................................. 10

COMPUTE: NVIDIA DGX-2 SYSTEM ............................................................................................................................... 10

BILL OF MATERIALS ........................................................................................................................................................ 10

SOFTWARE VERSIONS ................................................................................................................................................... 11

Deep learning training performance and analysis ................................................................................................................. 11

BENCHMARK METHODOLOGY ...................................................................................................................................... 11

BENCHMARK RESULTS .................................................................................................................................................. 13

SYSTEM METRICS ........................................................................................................................................................... 14

MEASUREMENT OF NETWORK I/O BETWEEN DGX-2 SYSTEMS .............................................................................. 16

UNDERSTANDING FILE CACHING ................................................................................................................................. 18

UNDERSTANDING THE TRAINING PIPELINE ............................................................................................................... 18

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL) ......................................................................................... 19

Storage-only performance ..................................................................................................................................................... 20

STORAGE NETWORK PERFORMANCE USING IPERF ................................................................................................ 20

STORAGE-ONLY PERFORMANCE USING FIO ............................................................................................................. 20

STORAGE-ONLY PERFORMANCE USING TENSORFLOW .......................................................................................... 20

Solution sizing guidance ....................................................................................................................................................... 23

Conclusions ........................................................................................................................................................................... 24

Acknowledgements ............................................................................................................................................................... 25

Appendix – System configuration ......................................................................................................................................... 26

ISILON ............................................................................................................................................................................... 26

Configuration .................................................................................................................................................................. 26

Configuring automatic storage tiering ............................................................................................................................ 26

Testing automatic storage tiering ................................................................................................................................... 28

DELL EMC POWERSWITCH S5232F-ON DATA SWITCHES ......................................................................................... 29

NVIDIA DGX-2 SYSTEM ................................................................................................................................................... 29

INSTALL AI BENCHMARK UTILITIES .............................................................................................................................. 31

ISILON VOLUME MOUNTING .......................................................................................................................................... 32

Appendix – Benchmark setup ............................................................................................................................................... 32

4 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

CREATING THE IMAGENET TFRECORD DATASETS ................................................................................................... 32

OBTAIN THE TENSORFLOW BENCHMARKS ................................................................................................................ 32

START TENSORFLOW CONTAINERS ............................................................................................................................ 33

Appendix – Monitoring Isilon performance ............................................................................................................................ 34

INSIGHTIQ ........................................................................................................................................................................ 34

ISILON STATISTICS CLI .................................................................................................................................................. 34

Appendix – Isilon performance testing with iPerf and FIO .................................................................................................... 35

IPERF ................................................................................................................................................................................ 35

Using iPerf to test Isilon to DGX-2 system performance ............................................................................................... 35

Using iPerf to test DGX-2 system storage network performance .................................................................................. 36

FIO ..................................................................................................................................................................................... 36

Appendix – Isilon performance testing with TensorFlow ...................................................................................................... 37

Appendix – Switch configuration ........................................................................................................................................... 37

Appendix – Collecting system metrics with Prometheus and Grafana ................................................................................. 42

References ............................................................................................................................................................................ 44

5 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

Revisions

Date

Description

Author

February 2020

Initial release

Claudio Fahey

Executive summary

Deep learning (DL) techniques have enabled great successes in many fields such as computer vision, natural language processing (NLP), gaming and autonomous driving by enabling a model to learn from existing data and then to make corresponding predictions. The success is due to a combination of improved algorithms, access to larger datasets and increased computational power. To be effective at enterprise scale, the computational intensity of DL requires highly powerful and efficient parallel architectures. The choice and design of the system components, carefully selected and tuned for DL use-cases, can have a big impact on the speed, accuracy and business value of implementing artificial intelligence (AI) techniques.

This paper focuses on how Dell EMC Isilon F800 all-flash scale-out NAS and Dell EMC PowerSwitch S5232F-ON switches accelerate AI innovation by delivering the performance, scalability and concurrency to complement the requirements of NVIDIA DGX-2™ systems for high performance AI workloads.

Audience

This document is intended for organizations interested in simplifying and accelerating DL solutions with advanced computing and scale-out data management solutions. Solution architects, system administrators and other interested readers within those organizations constitute the target audience.

Introduction

DL is an area of AI which uses artificial neural networks to enable accurate pattern recognition of complex real-world patterns by computers. These new levels of innovation have applicability across nearly every industry vertical. Some of the early adopters include advanced research, precision medicine, high tech manufacturing, advanced driver assistance systems (ADAS) and autonomous driving. Building on these initial successes, AI initiatives are springing up in various business units, such as manufacturing, customer support, life sciences, marketing, and sales. Gartner predicts that AI augmentation will generate $2.9 trillion in business value by 2021 alone. Organizations are faced with a multitude of complex choices related to data, analytic skill-sets, software stacks, analytic toolkits, and infrastructure components; each with significant implications on the time to market and the value associated with these initiatives.

In such a complex environment, it is critical that organizations be able to rely on vendors that they trust. Over the last few years, Dell Technologies and NVIDIA have established a strong partnership to help organizations accelerate their AI initiatives. Our partnership is built on the philosophy of offering flexibility and informed choice across an extensive portfolio. Together our technologies provide the foundation for successful AI solutions which drive the development of advanced DL software frameworks, deliver massively parallel compute in the form of NVIDIA GPUs for parallel model training and scale-out file systems to support the concurrency, performance, and capacity requirements of unstructured image and video data sets.

This document focuses on the latest step in the Dell Technologies and NVIDIA collaboration, a new AI reference architecture with Isilon F800 storage, PowerSwitch S5232F-ON switches, and DGX-2 systems for DL workloads. This new offer gives customers more flexibility in how they deploy scalable, high performance DL. The results of multiple industry standard image classification benchmarks using TensorFlow are included.

Deep learning dataflow

As visualized in Figure 1, DL usually consist of two distinct workflows, model development and inference.

6 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

Figure 1: Common DL Workflows: Model development and inference

Note: The Isilon storage and DGX-2 system architecture is optimized for the model development workflow which consists of the model training and the batch inference validation steps. It is not intended for and nor was it benchmarked for production inference.

The workflow steps are defined and detailed below.

1. Ingest Labeled Data — The labeled data (e.g. images and their labels which indicate whether the image contains a dog, cat, or horse) are ingested into the DL system.

2. Transform — Transformation includes all operations that are applied to the labeled data before they are passed to the DL algorithm. It is sometimes referred to as preprocessing. For images, this often includes file parsing, JPEG decoding, cropping, resizing, rotation, and color adjustments. Transformations can be performed on the entire dataset ahead of time, storing the transformed data on disk. Many transformations can also be applied in a training pipeline, avoiding the need to store the intermediate data.

3. Train Model — The model parameters (edge weights) are learned from the labeled data using the stochastic gradient descent optimization method. In the case of image classification, there are several prebuilt structures of neural networks that have been shown to work well. To provide an example, Figure 2 shows the high-level structure of the Inception-v3 model which contains nearly 25 million parameters that must be learned. In this diagram, images enter from the left and the probability of each class comes out on the right.

5. Figure 2: Inception-v3 model architecture.

6. Validate Model — Once the model training phase completes with a satisfactory accuracy, you’ll want to measure the accuracy of it on validation data – data that the model training process has not seen. This is done by using the

7 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

trained model to make inferences from the validation data and comparing the result with the correct label. This is often referred to as inference but keep in mind that this is a distinct step from production inference.

7. Production Inference — The trained and validated model is then often deployed to a system that can perform real-time inference. It will accept as input a single image and output the predicted class (dog, cat, horse). In some cases, inputs are batched for higher throughput but higher latency.

Solution architecture

OVERVIEW Figure 3 illustrates the reference architecture showing the key components that made up the solution as it was tested and

benchmarked. Note that in a customer deployment, the number of DGX-2 systems and Isilon storage nodes will vary and can be scaled independently to meet the requirements of the specific DL workloads. Refer to the Solution sizing guidance section for details.

(8) Dell Isilon F800 nodes in (2) Isilon chassis

(3) NVIDIA DGX-2 systems

Note: Backend 40 GbE switches

for Isilon not shown

(2) Dell S5232F-ON data switches

40 GbE NFS

100 GbE RoCE

100 GbE NFS

Figure 3: Reference Architecture

STORAGE: DELL EMC ISILON F800 Dell EMC Isilon F800 represents the sixth generation of hardware built to run the well-proven and massively scalable

OneFS operating system. Each F800 chassis, shown in Figure 4, contains four storage nodes, 60 high-performance 8 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

solid-state drives (SSDs) and eight 40 GbE network connections. OneFS combines up to 252 nodes in 63 chassis into a single high-performance file system designed to handle the most intense I/O workloads such as DL. As performance and capacity demands increase, both can be scaled-out simply and non-disruptively, allowing applications and users to continue working.

Figure 4: Isilon F800 chassis, containing four storage nodes

In the solution tested in this document, eight F800 nodes, in two chassis, were used. Dell EMC Isilon F800 has the following features.

Low latency, high throughput, and massively parallel I/O for AI:

• Up to 250,000 file IOPS per chassis, up to 15.75 million IOPS per cluster

• Up to 15 GB/s throughput per chassis, up to 945 GB/s per cluster

• 96 TB to 924 TB raw flash capacity per chassis; up to 58 PB per cluster (all-flash)

This shortens time for training and testing analytical models for data sets from tens of TBs to tens of PBs on AI platforms such as RAPIDS, TensorFlow, SparkML, Caffe, or proprietary AI platforms.

The ability to run AI in-place on data using multi-protocol access:

• Multi-protocol support such as SMB, NFS, HTTP, and native HDFS to maximize operational flexibility

This eliminates the need to migrate/copy data and results over to a separate AI stack. Organizations can perform DL and run other IT apps on the same data already on Isilon by adding additional Isilon nodes to an existing cluster.

Enterprise grade features out-of-box:

• Enterprise data protection and resiliency

• Robust security options

This enables organizations to manage AI data lifecycle with minimal cost and risk, while protecting data and meeting regulatory requirements.

Extreme scale:

• Seamlessly tier between All Flash, Hybrid, and Archive nodes via SmartPools

• Grow-as-you-go scalability with up to 58 PB flash capacity per cluster

• New nodes can be added to a cluster simply by connecting power, back-end Ethernet and front-end Ethernet

• As new nodes are added, storage capacity, throughput, IOPS, cache, and CPU grow

• Up to 63 chassis (252 nodes) may be connected to form a single cluster with a single namespace and a single

coherent cache

• Up to 85% storage efficiency to reduce costs

• Optional data de-dup and compression enabling up to a 3:1 data reduction

Organizations can achieve AI at scale in a cost-effective manner, enabling them to handle multi-petabyte datasets with high resolution content without re-architecture and/or performance degradation.

9 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

There are several key features of Isilon OneFS that make it an excellent storage system for DL workloads that require performance, concurrency, and scale. These features are detailed below.

Storage tiering Dell EMC Isilon SmartPools software enables multiple levels of performance, protection, and storage density to co-exist

within the same file system and unlocks the ability to aggregate and consolidate a wide range of applications within a single extensible, ubiquitous storage resource pool. This helps provide granular performance optimization, workflow isolation, higher utilization, and independent scalability – all with a single point of management.

SmartPools allows you to define the value of the data within your workflows based on policies and automatically aligns data to the appropriate price/performance tier over time. Data movement is seamless and with file-level granularity and control via automated policies, manual control or API, you can tune performance and layout, storage tier alignment and protection settings – all with minimal impact to your end-users.

Storage tiering has a very convincing value proposition, namely separating data according to its business value and aligning it with the appropriate class of storage and levels of performance and protection. Information Lifecycle Management techniques have been around for several years, but have typically suffered from the following inefficiencies: complex to install and manage, involves changes to the file system, requires the use of stub files, etc.

Dell EMC Isilon SmartPools is a next generation approach to tiering that facilitates the management of heterogeneous clusters. The SmartPools capability is native to the Isilon OneFS scale-out file system, which allows for unprecedented flexibility, granularity, and ease of management. In order to achieve this, SmartPools leverages many of the components and attributes of OneFS, including data layout and mobility, protection, performance, scheduling and impact management.

A typical Isilon cluster will store multiple datasets with different performance, protection, and price requirements. Generally, files that have been recently created and accessed should be stored in a hot tier while files that have not been accessed recently should be stored in a cold tier. Because Isilon supports tiering based on a file’s access time, this can be performed automatically. For storage administrators that want more control, complex rules can be defined to set the storage tier based on a file’s path, size, or other attributes.

All files on Isilon are always immediately accessible (read and write) regardless of their storage tier and even while being moved between tiers. The file system path to a file is not changed by tiering. Storage tiering policies are applied, and files are moved by the Isilon SmartPools job, which runs daily at 22:00 by default.

For more details, see Storage Tiering with Dell EMC Isilon SmartPools. OneFS caching

The OneFS caching infrastructure design is predicated on aggregating the cache present on each node in a cluster into one globally accessible pool of memory. This allows all the memory cache in a node to be available to every node in the cluster. Remote memory is accessed over an internal interconnect and has lower latency than accessing hard disk drives and SSDs.

For files marked with an access pattern of concurrent or streaming, OneFS can take advantage of prefetching of data based on heuristics used by the Isilon SmartRead component. This greatly improves sequential-read performance across all protocols and means that reads come directly from RAM within milliseconds. For high-sequential cases, SmartRead can very aggressively prefetch ahead, allowing reads of individual files at very high data rates.

For more details, see OneFS SmartFlash. Locks and concurrency

OneFS has a fully distributed lock manager that coordinates locks on data across all nodes in a storage cluster. The lock manager is highly extensible and allows for multiple lock personalities to support both file system locks as well as clustercoherent protocol-level locks such as SMB share mode locks or NFS advisory-mode locks. OneFS also has support for delegated locks such as CIFS oplocks and NFSv4 delegations. Every node in a cluster is a coordinator for locking resources and a coordinator is assigned to lockable resources based upon an advanced hashing algorithm.

Efficient locking is critical to support the efficient parallel I/O profile demanded by many iterative DL workloads enabling concurrent file read access up into the millions.

For more details, see the OneFS Technical Overview.

10 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

NETWORKING: DELL EMC POWERSWITCH S5232F-ON SWITCH The Dell EMC PowerSwitch S5200-ON 25/100 GbE fixed switches (Figure 5) comprise Dell EMC’s latest disaggregated

hardware and software data center networking solutions, providing state-of-the-art, high-density 25/100 GbE ports and a broad range of functionality to meet the growing demands of today’s data center environment. These innovative, nextgeneration open networking switches offer optimum flexibility and cost-effectiveness for web 2.0, enterprise, mid- market and cloud service provider with demanding compute and storage traffic environments. This series supports RDMA over Converged Ethernet (RoCE) which allows a GPU to communicate with a NIC directly across the PCIe bus, without involving the CPU. Both RoCE v1 and v2 are supported.

The PowerSwitch S5232F-ON is a 1 RU switch with 32 QSFP28 ports that can provide 40 GbE and 100 GbE.

Figure 5: Dell EMC PowerSwitch S5232F-ON, 32 QSFP28 ports that support 40/100 GbE

COMPUTE: NVIDIA DGX-2 SYSTEM The DGX-2 system (Figure 6) is a fully integrated, turnkey hardware and software system that is purpose-built for DL

workflows. Each DGX-2 system is powered by 16 NVIDIA Tesla V100 GPUs that are interconnected using NVIDIA NVSwitch technology, which provides an ultra-high bandwidth low-latency fabric for inter-GPU communication. This topology is essential for multi-GPU training, eliminating the bottleneck that is associated with PCIe-based interconnects that cannot deliver linearity of performance as GPU count increases. The DGX-2 system is also equipped with eight highbandwidth, low-latency network interconnects for multi-node clustering over RDMA-capable fabrics including RoCE and InfiniBand.

Figure 6: NVIDIA DGX-2 system with 16 Tesla V100 GPUs

NVIDIA GPU CLOUD The NVIDIA GPU Cloud (NGC) container registry provides researchers, data scientists and developers with simple

access to a comprehensive catalog of GPU-accelerated software for AI, DL, machine learning (ML) and HPC that take full advantage of NVIDIA DGX-2 systems. NGC provides containers for today’s most popular AI frameworks such as RAPIDS, Caffe2, TensorFlow, PyTorch, MXNet and TensorRT, which are optimized for NVIDIA GPUs. The containers integrate the framework or application, necessary drivers, libraries and communications primitives and they are optimized across the stack by NVIDIA for maximum GPU-accelerated performance. NGC containers incorporate the NVIDIA CUDA® Toolkit, which provides the NVIDIA CUDA Basic Linear Algebra Subroutines Library (cuBLAS), the NVIDIA CUDA Deep Neural Network Library (cuDNN), and much more. The NGC containers also include the NVIDIA Collective Communications Library (NCCL) for multi-GPU and multi-node collective communication primitives, enabling topology awareness for DL training. NCCL enables communication between GPUs inside a single DGX-2 system and across multiple DGX-2 systems.

BILL OF MATERIALS

Component

Purpose

Quantity

Dell EMC Isilon F800 96 TB SSD

Shared storage

2 4U chassis (8 nodes)

11 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

1 TB RAM Four 1 GbE, eight 40 GbE interfaces

Celestica D4040 851-0259

Isilon back-end Ethernet switch

Dell EMC PowerSwitch S5232F-ON

Data switch

NVIDIA DGX-2 system 16 Tesla V100-SXM3-32 GB GPUs Two 24-Core Intel Xeon Platinum 8168 2.70 GHz

1.5 TB RAM One 2-port ConnectX5 NIC for storage

Compute server

100 GbE optical cable, QSFP28

DGX-2 system

100 GbE cable, QSFP28

data switch inter-switch links

Cable Mellanox 2m QSFP28 Copper 40 G MC2206130-002

Isilon front-end

Cable EMC AMPHENOL 3.0M 40G 038-002-066-02 1849, QSFP28

Isilon back-end

Table 1: Bill of materials

SOFTWARE VERSIONS Table 2 shows the software versions that were tested for this document.

Component

Version

AI Benchmark Util

https://github.com/claudiofahey/ai-benchmark-util/commit/99e8167

Dell EMC Isilon – OneFS

8.1.2.0 Patches: 8.1.2.0_KGA-RUP_2019-09_256178, 8.1.2.0_UGA-PATCHINFRA_2019-09_255624, 8.1.2.0_UGA-RUP_2019-09_256176

Dell EMC Networking OS10 Enterprise

10.5.0.0.326

DGX-2 – Base OS

4.2.0

DGX-2 – BIOS

0.24

DGX-2 – Linux kernel

4.15.0-65-generic

DGX-2 – NVIDIA Driver

418.67

DGX-2 – Ubuntu

18.04.3 LTS

NVIDIA GPU Cloud TensorFlow Image

nvcr.io/nvidia/tensorflow:19.09-py3

TensorFlow

1.14.0

TensorFlow Benchmarks

https://github.com/claudiofahey/benchmarks/commit/31ea13f

Table 2: Software Versions

Deep learning training performance and analysis

BENCHMARK METHODOLOGY In order to measure the performance of the solution, various benchmarks from the TensorFlow Benchmarks repository

were executed. This suite of benchmarks performs training of an image classification convolutional neural network (CNN) on labeled images. Essentially, the system learns whether an image contains a cat, dog, car, train, etc. The well-known

ILSVRC2012 image dataset (often referred to as ImageNet) was used. This dataset contains 1,281,167 training images in

144.8 GB1. All images are grouped into 1000 categories or classes. This dataset is commonly used by DL researchers for benchmarking and comparison studies.

The individual JPEG images in the ImageNet dataset were converted to 1024 TFRecord files (see Appendix – Benchmark setup). The TFRecord file format is a Protocol Buffers binary format that combines multiple JPEG image files together with their metadata (bounding box for cropping and label) into one binary file. It maintains the image compression offered by the JPEG format and the total size of the dataset remained roughly the same (148 GB). The average image size was 115 KB.

When running the benchmarks on the 148 GB dataset, it was found that the storage I/O throughput gradually decreased and became virtually zero after a few minutes. This indicated that the entire dataset was cached in the Linux buffer cache on each DGX-2 system. Of course, this is not surprising since each DGX-2 system has 1.5 TB of RAM and this workload did not significantly use RAM for other purposes. As real datasets are often significantly larger than this, we wanted to determine the performance with datasets that are not only larger than the DGX-2 system RAM, but larger than the 2 TB of

All unit prefixes in this document use the SI standard (base 10) where 1 GB is 1 billion bytes.

12 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

coherent shared cache available across the eight-node Isilon cluster. To accomplish this, we simply made 150 exact copies of each TFRecord file, creating a 22.2 TB dataset.

In our own testing, a parallel Python MPI script was created to quickly create the copies utilizing all DGX-2 systems and Isilon nodes. But to illustrate the copy process, it was basically as simple as this:

cp train-00000-of-01024 train-00000-of-01024-copy-000 cp train-00000-of-01024 train-00000-of-01024-copy-001 cp train-00000-of-01024 train-00000-of-01024-copy-002 cp train-00001-of-01024 train-00001-of-01024-copy-000 cp train-00001-of-01024 train-00001-of-01024-copy-001 cp train-00001-of-01024 train-00001-of-01024-copy-002 … cp train-01023-of-01024 train-01023-of-01024-copy-002

Having 150 copies of the exact same images doesn’t improve training accuracy or speed but it does produce the same

I/O pattern for the storage, network, and GPUs. Having identical files did not provide an unfair advantage as Isilon deduplication was not enabled and all images are reordered randomly (shuffled) in the input pipeline.

For this workload, it is straightforward to quantify the effect of caching. Essentially there are many threads reading the TFRecord files. Each thread picks a TFRecord file at random, reads it sequentially and completely and then it moves on to another TFRecord file at random. Sometimes, a thread will choose to read a file that another thread has recently read and that can be served from cache (either Isilon or Linux). The probability of this occurring and therefore the fraction of data served by cache, is simply the cache size divided by the dataset size. Using this calculation, we expect a 7% cache hit rate from the Linux cache on the DGX-2 system and a 9% cache hit rate from Isilon. Conversely, we can say that 91% of the dataset is read from Isilon’s SSDs. Additionally, if the dataset were twice the size (44 TB), the Isilon SSD read rate would only increase by 4%.

One of the critical questions one has when trying to size a system is how fast the storage must be so that it is not a bottleneck. To answer this question, we take advantage of the fact that the Linux buffer cache can completely cache the entire 148 GB dataset. After performing several warm-up runs, the benchmark is executed, and it is confirmed that there is virtually zero NFS network I/O to the storage system. The image rate (images/sec) measured in this way accounts for the significant preprocessing pipeline as well as the GPU computation. To determine the throughput (bytes/sec) demanded by this workload, we simply multiply the images/sec by the average image size (115 KB). In the next section, results using this method are labeled Linux Cache.

To avoid confusion, the performance results using the synthetic benchmark mode are not reported in this document. In the synthetic mode of the benchmark, random images are generated directly in the GPU and training is based on these images. This mode is very useful to tune and understand parts of the training pipeline. However, it does not perform storage I/O, JPEG decoding, image resizing, or any other preprocessing.

There are a variety of ways to parallelize model training to take advantage of multiple GPUs across multiple servers. In our tests, we used Horovod. Horovod is a distributed DL training framework for TensorFlow and other DL frameworks. Horovod uses Message Passing Interface (MPI) for high-speed and low-latency communication between processes.

Prior to each execution of the benchmark, the L1 and L2 caches on Isilon were flushed with the command

isi_for_array isi_flush. In addition, the Linux buffer cache was flushed on all DGX-2 systems by running sync; echo 3 > /proc/sys/vm/drop_caches. However, note that the training process will read the same files repeatedly

and after just several minutes, much of the data will be served from one of these caches. The command below was used to perform the ResNet-50 training with 48 GPUs.

mpirun \

--n 48 \

-allow-run-as-root \

--host dgx2-1:16,dgx2-2:16,dgx2-3:16 \

--report-bindings \

-bind-to none \

-map-by slot \

-x LD_LIBRARY_PATH \

-x PATH \

-mca plm_rsh_agent ssh \

-mca plm_rsh_args “-p 2222” \

-mca pml ob1 \

-mca btl ^openib \

-mca btl_tcp_if_include enp53s0\

13 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

-x NCCL_DEBUG=INFO \

-x NCCL_IB_HCA=mlx5 \

-x NCCL_IB_SL=4 \

-x NCCL_IB_GID_INDEX=3 \

-x NCCL_NET_GDR_READ=1 \

-x NCCL_SOCKET_IFNAME=^docker0,lo \ ./round_robin_mpi.py \ python \

-u \ /mnt/isilon/data/tensorflow-benchmarks/scripts/tf_cnn_benchmarks/\ tf_cnn_benchmarks.py \

--model=resnet50 \

--batch_size=256 \

--batch_group_size=20 \

--num_batches=1000 \

--nodistortions \

--num_gpus=1 \

--device=gpu \

--force_gpu_compatible=True \

--data_format=NCHW \

--use_fp16=True \

--use_tf_layers=True \

--data_name=imagenet \

--use_datasets=True \

--num_intra_threads=1 \

--num_inter_threads=40 \

--datasets_prefetch_buffer_size=40 \

--datasets_num_private_threads=4 \

--train_dir=/mnt/isilon/data/train_dir/2019-10-24-14-53-59-resnet50 \

--sync_on_finish=True \

--summary_verbosity=1 \

--save_summaries_steps=100 \

--save_model_secs=600 \

--variable_update=horovod \

--horovod_device=gpu \

--data_dir=/mnt/isilon1/data/imagenet-scratch/tfrecords-150x \

--data_dir=/mnt/isilon2/data/imagenet-scratch/tfrecords-150x \ ...

--data_dir=/mnt/isilon16/data/imagenet-scratch/tfrecords-150x

The script round_robin_mpi.py was used to select a single --data_dir parameter that distributed the processes across 16 different mount points. Note that when testing the Linux Cache performance, only a single mount point was used.

For different numbers of GPUs, only the --np parameter was changed. Note that the -map-by slot setting causes MPI to use all 16 GPUs (slots) on a DGX-2 system before it begins using the next DGX-2 system. For example, when testing 32 GPUs, only two DGX-2 systems are used.

For the other models, only the --model parameter was changed. However, for VGG-16, the batch size was set to 192. The benchmark results in this section were obtained with eight Isilon F800 nodes in the cluster. Each result is the average

of three executions. BENCHMARK RESULTS

There are a few conclusions that we can make from the benchmarks represented in Figure 7.

• Image throughput and therefore storage throughput scale linearly from 16 to 48 GPUs.

• There is no significant difference in image throughput between Linux Cache and Isilon.

14 Dell EMC Isilon, PowerSwitch and NVIDIA DGX-2 Systems for Deep Learning | H18079

Figure 7: Model Development – Training Benchmark Results

SYSTEM METRICS System metrics were collected and analyzed for all tests shown in Figure 7. In Figure 8, only metrics captured during

three runs of ResNet-50 training on 48 GPUs are shown. There are a few conclusions that we can make from the GPU and CPU metrics as represented in Figure 8.

• Each GPU had 97% utilization or higher. This indicates that the GPUs were fully utilized.

• The maximum CPU core utilization on the DGX-2 system was 70%. This occurred with ResNet-50.

13,287

26,274

39,230

13,372

26,305

39,337

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

16 32 48

Images / Sec

GPUs

ResNet-50

Isilon Linux Cache

8,112

16,079

23,687

8,112

16,050

23,679

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

16 32 48

GPUs

VGG-16

8,375

16,457

24,351

8,408

16,500

24,413

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

16 32 48

GPUs

Inception-V3

+ 30 hidden pages

Dell DGX-2 User manual

Specifications and Main Features

Frequently Asked Questions

User Manual

Revisions

Executive summary

Audience

Introduction

Deep learning dataflow

Solution architecture

For more details, see OneFS SmartFlash. Locks and concurrency

BILL OF MATERIALS

Deep learning training performance and analysis