The NVIDIA® DGX-2™ System is the world’s first two-petaFLOPS system that engages
16 fully interconnected GPUs for accelerated deep learning performance. The DGX-2
System is powered by NVIDIA® DGX™ software stack and an architecture designed for
Deep Learning, and High-Performance Computing and analytics.
DGX-2 System User Guide
5
Introduction to the NVIDIA DGX-2 System
ID
Component
Qty
Description
ABOUT THIS DOCUMENT
This document is for users and administrators of the DGX-2 System. It is organized as
follows:
Chapters 1-4: Overview of the DGX-2 System, including basic first-time setup and
operation
Chapters 5-6: Network and storage configuration instructions.
Chapters 7-9: Software and firmware update instructions
Chapters 9: How to use the BMC
Chapter 10: How to configure and use the DGX-2 System as a Kernel Virtual Machine
host
HARDWARE OVERVIEW
1.2.1 Major Components
The following diagram shows the major components of the DGX-2 System.
Left side port designation: enp134s0f0
Right side port designation:
3 1 RJ45 network port (for in-band management)
4 2 USB 3.0 ports
DGX-2 System User Guide
10
ID Qty Description
Ports
5 1 IPMI port (for out-of-band management (BMC))
6 1 VGA port
7 1 Serial port (DB-9)
8 1 System ID LED
Blinks blue when ID button is pressed from the front of the unit
as an aid in identifying the unit needing servicing
9 1 BMC reset button
10 1 Power and BMC heartbeat LED
On/Off – BMC is not ready
Blinking – BMC is ready
NETWORK PORTS
Introduction to the NVIDIA DGX-2 System
The following figure highlights the available network ports and their purpose.
ID Connectivity Uses
BMC (remote
1
management and
monitoring)
Motherboard
2
RJ45
3 ConnectX-5 (LP)
Ethernet mode
Out-of-band
management
In-band
management,
administration
Storage (NFS)
System
communication
Number of
1
1
2
(Left):
enp134s0f0
Port Type Cable Type
100/1000
RJ45
RJ45
QSFP28
Ethernet
Cat5E/6 Ethernet
100/1000
Ethernet
Cat5E/6 Ethernet
100 GbE (QSFP28)
10/25/40 GbE
DGX-2 System User Guide
11
Introduction to the NVIDIA DGX-2 System
Ports
(Right):
enp134s0f1
(QSFP28 to SFP28
Ports
Type
enp134s0f1
ID Connectivity Uses
4 ConnectX-5
InfiniBand mode
Ethernet mode
Clustering
Storage
Number of
8
Port Type Cable Type
QSFP28
or SFP+)
InfiniBand EDR
100
Ethernet 100GbE
RECOMMENDED PORTS TO USE FOR EXTERNAL
STORAGE
For clarity, the following figure reiterates the recommended ports to use for external
storage. In most configurations, the storage ports (ID 1 below) should be used for
connecting to high-speed NAS storage, while the cluster ports (ID 2 below) should be
used for communication between nodes.
ID Connectivity Uses Number of
1 ConnectX-5 (LP) Storage (NFS) 2
2 ConnectX-5
InfiniBand mode
Ethernet mode
DGX-2 System User Guide
Port
(Left):
enp134s0f0
(Right):
Cluster
12
8
QSFP28 1/10/25/40/100 GbE
QSFP28
Cable Type
EDR InfiniBand or 100
GbE
Introduction to the NVIDIA DGX-2 System
DGX OS SOFTWARE
The DGX-2 System comes installed with a base OS incorporating
An Ubuntu server distribution with supporting packages
The NVIDIA driver
Docker CE
NVIDIA Container Runtime for Docker
The following health monitoring software
● NVIDIA System Management (NVSM)
Provides active health monitoring and system alerts for NVIDIA DGX nodes in a
data center. It also provides simple commands for checking the health of the
DGX-2 SYSTEM from the command line.
● Data Center GPU Management (DCGM)
This software enables node-wide administration of GPUs, and can be used for
cluster and data-center level management.
ADDITIONAL DOCUMENTATION
Note: Some of the documentation listed below are not available at the time of
publication. See https://docs.nvidia.com/dgx/ for the latest status.
DGX-2 System Service Manual
Instructions for servicing the DGX-2 System, including how to replace select
components.
DGX OS Server Release Notes
Provides software component versions as well as a list of changes and known issues
in the installed OS software.
NGC Container Registry for DGX
How to access the NGC container registry for using containerized deep learning
GPU-accelerated applications on your DGX-2 System.
NVSM Software User Guide
Contains instructions for using the NVIDIA System Management software.
DCGM Software User Guide
Contains instructions for using the Data Center GPU Management software.
DGX-2 System User Guide
13
Introduction to the NVIDIA DGX-2 System
CUSTOMER SUPPORT
Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or
diagnosing problems with your
Support for assistance in installing or moving the DGX-2 System. You can contact
NVIDIA Enterprise Support in the following ways.
1.7.1 NVIDIA Enterprise Support Portal
The best way to file an incident is to log on to the NVIDIA Enterprise Support portal.
1.7.2 NVIDIA Enterprise Support Email
You can also send an email to enterprisesupport@nvidia.com.
DGX-2 System. Also contact NVIDIA Enterprise
1.7.3 NVIDIA Enterprise Support - Local Time Zone
Phone Numbers
Visit the NVIDIA Enterprise Support page.
DGX-2 System User Guide
14
CONNECTING TO THE DGX-2 CONSOLE
Connect to the DGX-2 console using either a direct connection, a remote connection
through the BMC, or through an SSH connection.
CAUTION: Connect directly to the DGX-2 console if the DGX-2 System is connected
to a 172.17.xx.xx subnet.
DGX OS Server software installs Docker CE which uses the 172.17.xx.xx subnet by
default for Docker containers. If the DGX-2 System is on the same subnet, you will not
be able to establish a network connection to the DGX-2 System.
Refer to the section Configuring Docker IP Addresses for instructions on how to change
the default Docker network settings.
DGX-2 System User Guide
15
Connecting to the DGX-2 Console
DIRECT CONNECTION
At either the front or the back of the DGX-2 System, connect a display to the VGA
connector, and a keyboard to any of the USB ports.
DGX-2 Server Front
DGX-2 Server Back
DGX-2 System User Guide
16
Connecting to the DGX-2 Console
REMOTE CONNECTION THROUGH THE BMC
See the section Configuring Static IP Address for the BMC if you need to configure a
static IP address for the BMC.
This method requires that you have the BMC login credentials. These credentials
depend on the following conditions:
Prior to first time boot: The default credentials are
Username: admin
Password: admin
After first boot setup: The administrative user username that was set up during the
initial boot is used for both the BMC username and BMC password.
Username: <administrator-username>
Password: <administrator-username>
After first boot setup with changed password: The BMC password can be changed
from “<system-username>”, in which case the credentials are
Username: <administrator-username>
Password: <new-bmc-password>
1. Make sure you have connected the BMC port on the DGX-2 System to your LAN.
2. Open a browser within your LAN and go to:
https://<ipmi-ip-address>/
Make sure popups are allowed for the BMC address.
3. Log in.
DGX-2 System User Guide
17
4. From the left-side navigation menu, click Remote Control.
Connecting to the DGX-2 Console
The Remote Control page allows you to open a virtual Keyboard/Video/Mouse
(KVM) on the DGX-2 System, as if you were using a physical monitor and keyboard
connected to the front of the system.
5. Click Launch KVM.
The DGX-2 console appears in your browser.
DGX-2 System User Guide
18
Connecting to the DGX-2 Console
SSH CONNECTION
You can also establish an SSH connection to the DGX-2 System through the network
port. See the section Network Ports
Configuring Static IP Addresses for the Network Portsif you need to configure a static
IP address.
to identify the port to use, and the section
DGX-2 System User Guide
19
SETTING UP THE DGX-2 SYSTEM
While NVIDIA service personnel will install the DGX-2 System at the site and perform
the first boot setup, the first boot setup instructions are provided here for reference and
to support any re-imaging of the server.
These instructions describe the setup process that occurs the first time the DGX-2 System
is powered on after delivery or after the server is re-imaged.
Be prepared to accept all End User License Agreements (EULAs) and to set up your
username and password.
1. Connect to the DGX-2 console as explained in Connecting to the DGX-2 Console.
2. Power on the DGX-2 System.
● Using the physical power button
DGX-2 System User Guide
20
Setting Up the DGX-2 System
● Using the Remote BMC
The system will take a few minutes to boot.
You are presented with end user license agreements (EULAs) for the NVIDIA
software.
DGX-2 System User Guide
21
Setting Up the DGX-2 System
!
3.Accept all EULAs to proceed with the installation.
The system boots and you are prompted to configure the DGX-2 software.
4. Perform the steps to configure the DGX-2 software.
● Select your language and location.
● Create a user account with your name, username, and password.
You will need these credentials to log in to the DGX-2 System as well as to log in to
the BMC remotely. When logging in to the BMC, enter your username for both the
User ID as well as the password. Be sure to create a unique BMC password at the
first opportunity.
CAUTION: Once you create your login credentials, the default admin/admin login will
no longer work.
Note: The BMC software will not accept "sysadmin" for a user name. If you create this
user name for the system log in, "sysadmin" will not be available for logging in to the
BMC.
● Choose a primary network interface for the DGX-2 System; for example, enp6s0.
This should typically be the interface that you will use for subsequent system
configuration or in-band management.
Note: After you select the primary network interface, the system attempts to configure
the interface for DHCP and then asks you to enter a hostname for the system. If DHCP
is not available, you will have the option to configure the network manually. If you
need to configure a static IP address on a network interface connected to a DHCP
network, select Cancel at the Network configuration – Please enter the
hostname for the system screen. The system will then present a screen with the
option to configure the network manually.
● Choose a host name for the DGX-2 System.
After completing the setup process, the DGX-2 System reboots automatically and
then presents the login prompt.
5. Update the software to ensure you are running the latest version.
Updating the software ensures your DGX-2 System contains important updates,
including security updates. The Ubuntu Security Notice site (https://usn.ubuntu.com/
lists known Common Vulnerabilities and Exposures (CVEs), including those that can be
resolved by updating the DGX OS software.
)
a) Run the package manager.
DGX-2 System User Guide
22
$ sudo apt update
b) Upgrade to the latest version.
$ sudo apt full-upgrade
Note: RAID 1 Rebuild in Progress - When the system is booted after restoring the
image, software RAID begins the process of rebuilding the RAID 1 array - creating a
mirror of (or resynchronizing) the drive containing the software. System performance
may be affected during the RAID 1 rebuild process, which can take an hour to
complete.
During this time, the command “nvsm show health” will report a warning that the RAID
volume is resyncing.
You can check the status of the RAID 1 rebuild process using “sudo mdadm -D
/dev/md0”.
Setting Up the DGX-2 System
DGX-2 System User Guide
23
QUICK START INSTRUCTIONS
This chapter provides basic requirements and instructions for using the DGX-2 System,
including how to perform a preliminary health check and how to prepare for running
containers. Be sure to visit the DGX documentation website at
https://docs.nvidia.com/dgx/
for additional product documentation.
REGISTRATION
Be sure to register your DGX-2 System with NVIDIA as soon as you receive your
purchase confirmation e-mail. Registration enables your hardware warranty and allows
you to set up an NVIDIA GPU Cloud for DGX account.
To register your DGX-2 System, you will need information provided in your purchase
confirmation e-mail. If you do not have the information, send an e-mail to NVIDIA
Enterprise Support at enterprisesupport@nvidia.com.
1. From a browser, go to the NVIDIA DGX Product Registration page
2. Enter all required information and then click SUBMIT to complete the registration
process and receive all warranty entitlements and DGX-2 support services
entitlements.
).
DGX-2 System User Guide
24
Quick Start Instructions
INSTALLATION AND CONFIGURATION
Your DGX-2 System will be installed by NVIDIA service personnel or an authorized
installation partner.
Before installation, make sure you have completed the Site Survey and have given all
relevant site information to your Installation Partner.
OBTAINING AN NVIDIA GPU CLOUD ACCOUNT
NVIDIA GPU Cloud (NGC) provides simple access to GPU-optimized software tools for
deep learning and high-performance computing (HPC) that take full advantage of
NVIDIA GPUs. An NGC account grants you access to these tools as well as the ability to
set up a private registry to manage your customized tools.
Work with NVIDIA Enterprise Support to set up an NGC enterprise account if you are
the organization administrator for your DGX-2 purchase. See the NGC Container
Registry for DGX User Guide (
guide/) for detailed instructions on getting an NGC enterprise account.
Before using the DGX-2 System to run containers from the NGC container registry, you
must visit the NGC web site to obtain your NGC API Key and to determine which
containers are available to run.
4.4.1 Getting Your NGC API Key
Your NGC API Key authenticates your access to the NGC container registry with its
NVIDIA tuned, tested, certified, and maintained containers for the top deep learning
frameworks.
You only need to generate an API Key once. Should you lose your API Key, you can
generate a new one from the NGC website. When you generate a new API Key, the old
one is invalidated.
Perform the following instructions from any system with internet access and a browser.
DGX-2 System User Guide
25
Quick Start Instructions
1. Log in to the NGC website (https://ngc.nvidia.com).
2. Click Get API Key from the Registry page.
3. Click Generate API Key from the Configuration->API Key page.
4. Click Confirm at the Generate a New API Key dialog.
Your NGC API Key is displayed at the bottom of the Configuration->API Key page
with examples of how to use it.
NGC does not save your key, so store it in a secure place. You can copy your API Key
to the clipboard by clicking the copy icon to the right of the API key.
4.4.2 Selecting CUDA Container Tags for Verification
Examples
While you are logged in to the web site, select a CUDA container tag to use for the
verification procedure in the next section.
1. Select Registry from the left side menu.
2. Select a CUDA container tag.
c) Click the cuda repository (under the nvidia registry space).
d) In the Tag section, scroll down to find the latest ‘-runtime’ version. For example,
‘10.0-runtime’.
Note this tag as you will need to specify it when running the CUDA container in the
next section.
VERIFYING BASIC FUNCTIONALITY
This section walks you through the steps of performing a health check on the DGX-2
System, and verifying the Docker and NVIDIA driver installation.
1. Establish an SSH connection to the DGX-2 System.
2. Run a basic system check.
sudo nvsm show health
Verify that the output summary shows that all checks are Healthy and that the overall
system status is Healthy.
3. Verify that Docker is installed by viewing the installed Docker version.
sudo docker --version
This should return the version as “Docker version 18.03-ce”, where the actual
version may differ depending on the specific release of the DGX OS Server software.
DGX-2 System User Guide
26
Quick Start Instructions
4. Verify connection to the NVIDIA repository and that the NVIDIA Driver is installed.
sudo docker container run --runtime=nvidia --rm
nvcr.io/nvidia/cuda:<cuda-tag-obtained-from-previous-section> nvidia-smi
Docker pulls the nvidia/cuda container image layer by layer, then runs nvidia-smi.
When completed, the output should show the NVIDIA Driver version and a
description of each installed GPU.
See the NVIDIA Containers and Deep Learning Frameworks User Guide at
instructions, including an example of logging into the NGC container registry and
launching a deep learning container.
DGX-2 System User Guide
27
NETWORK CONFIGURATION
This chapter describes key network considerations and instructions for the DGX-2
System.
BMC SECURITY
NVIDIA recommends that customers follow best security practices for BMC
management (IPMI port). These include, but are not limited to, such measures as:
Restricting the DGX-2 IPMI port to an isolated, dedicated, management network
Using a separate, firewalled subnet
Configuring a separate VLAN for BMC traffic if a dedicated network is not available
CONFIGURING NETWORK PROXIES
If your network requires use of a proxy server, you will need to set up configuration
files to ensure the DGX-2 System communicates through the proxy.
5.2.1 For the OS and Most Applications
Edit the file /etc/environment and add the following proxy addresses to the file,
below the PATH line.
To ensure that Docker can access the NGC container registry through a proxy, Docker
uses environment variables. For best practice recommendations on configuring proxy
environment variables for Docker,
see https://docs.docker.com/engine/admin/systemd/#http-proxy.
CONFIGURING DOCKER IP ADDRESSES
To ensure that the DGX-2 System can access the network interfaces for Docker
containers, Docker should be configured to use a subnet distinct from other network
resources used by the DGX-2 System.
DGX-2 System User Guide
29
Network Configuration
By default, Docker uses the 172.17.0.0/16 subnet. Consult your network administrator
to find out which IP addresses are used by your network.
If your network does not
conflict with the default Docker IP address range, then no changes are needed and
you can skip this section.
However, if your network uses the addresses within this range for the DGX-2 System,
you should change the default Docker network addresses.
You can change the default Docker network addresses by either modifying
/etc/docker/daemon.json file or modifying the /etc/systemd/
the
system/docker.service.d/docker-override.conf
file. These instructions provide
an example of modifying the/etc/systemd/system/docker.service.d/docker-
override.conf to override the default Docker network addresses.
1. Open the docker-override.conf file for editing.
$ sudo vi /etc/systemd/system/docker.service.d/docker-override.conf