Dell R740xd User Manual

Deployment guide
Deployment guide
VMware vSphere Bitfusion on Dell EMC PowerEdge servers
Abstract
March 2021
Revisions
2 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
Revisions
Date
Description
September 2020
Initial release
March 2021
Document updated with support information for VMware vSphere Bitfusion 2.5
Acknowledgements
Authors: Jay Engh, Chris Gully Support: Gurupreet Kaushik, Sherry Keller
The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any software described in this publication requires an applicable software license.
Copyright © 03/05/2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners.
Table of contents
3 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
Table of contents
Revisions............................................................................................................................................................................. 2
Acknowledgements ............................................................................................................................................................. 2
Table of contents ................................................................................................................................................................ 3
Executive summary ............................................................................................................................................................. 5
1 New and enhanced in VMware vSphere Bitfusion 2.5 ................................................................................................. 6
2 Audience and scope ..................................................................................................................................................... 7
3 Overview ....................................................................................................................................................................... 8
4 Component overview .................................................................................................................................................. 10
4.1 DELL EMC PowerEdge R740xd server ........................................................................................................... 10
4.2 DELL EMC PowerEdge C4140 server ............................................................................................................. 10
4.3 Dell EMC vSAN Ready Node R740xd .............................................................................................................. 11
4.4 Dell EMC Networking S5248F-ON Switch ....................................................................................................... 12
4.5 NVIDIA T4 Datacenter GPU ............................................................................................................................. 12
4.6 NVIDIA V100 for NVLINK and PCIe, Datacenter GPU .................................................................................... 12
4.7 Mellanox ConnextX-5 Dual Port 10/25GbE Adapter ........................................................................................ 13
5 Pre-deployment requirements and introduction to new features ............................................................................... 14
5.1 GPU hosts ........................................................................................................................................................ 14
5.2 Client cluster ..................................................................................................................................................... 14
5.3 Introduction to new features ............................................................................................................................. 15
5.3.1 Remote clients .................................................................................................................................................. 15
5.3.2 Bare-metal server clients .................................................................................................................................. 15
5.3.3 Improved health checks .................................................................................................................................... 16
5.4 Bitfusion server and client software .................................................................................................................. 16
5.5 vCenter ............................................................................................................................................................. 16
5.6 Client virtual machine ....................................................................................................................................... 17
5.7 Connectivity ...................................................................................................................................................... 17
5.8 Network services .............................................................................................................................................. 17
6 Solution overview ....................................................................................................................................................... 18
6.1 Architecture ....................................................................................................................................................... 18
6.2 Component information .................................................................................................................................... 19
6.3 VLANs and IP subnet information. ................................................................................................................... 19
7 Deployment and configuration .................................................................................................................................... 20
7.1 Verify the GPU host hardware configuration .................................................................................................... 20
7.2 Add the GPU hosts and client cluster to the vCenter inventory ....................................................................... 20
Table of contents
4 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
7.3 Prepare and configure the GPU hosts and client cluster for PVRDMA ............................................................ 20
7.3.1 Tag a VMkernel adapter for PVRDMA ............................................................................................................. 21
7.3.2 Add and manage hosts on the Bitfusion distributed switch .............................................................................. 22
7.4 Deploy the Open Virtual Appliance (OVA) to create the bitfusion-server-1 virtual machine ............................ 23
7.5 Edit the bitfusion-server-1 hardware settings ................................................................................................... 27
7.6 Deploy the Open Virtual Appliance (OVA) to create the bitfusion-server-2 virtual machine ............................ 32
7.7 Edit the bitfusion-server-2 virtual machine hardware settings .......................................................................... 33
7.8 Deploy the OVA to create the bitfusion-server-3 virtual machine .................................................................... 35
7.9 Edit the bitfusion-server-3 virtual machine hardware settings .......................................................................... 36
7.10 Provide client cluster access to GPU resources .............................................................................................. 38
7.11 Support for remote clients and bare-metal server ............................................................................................ 40
7.12 Support for backup and restore ........................................................................................................................ 40
8 Getting help ................................................................................................................................................................ 43
8.1 Contacting Dell EMC ........................................................................................................................................ 43
8.2 Documentation resources ................................................................................................................................. 43
8.3 VMware Hands-On-Labs (HOLs) ..................................................................................................................... 43
8.4 Documentation feedback .................................................................................................................................. 43
Executive summary
5 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
Executive summary
Modern computational requirements have evolved and diverged with the wide acceptance and use of containers and virtualization. Today’s workloads are presenting challenges to the CPU-centric paradigm which has been a key determinant of customers' server performance and utility needs. General purpose CPUs no longer adequately handle these new workloads, whether they are artificial intelligence, machine learning or virtual desktops. However, CPU power can be enhanced with add-in graphics processing units (GPUs). GPUs leverage thousands of computing cores instead of the tens of cores that general purpose CPUs can address. While the cost of putting dedicated GPU hardware into every server is uneconomical, VMware vSphere Bitfusion on PowerEdge servers enables network delivery of these pooled resources to any properly configured client node. With Bitfusion built into the well-known vSphere infrastructure, administrators can optimize their use and increase the utilization of these expensive resources. This delivers immense value to those networked nodes which require powerful GPUs to execute massively parallel workloads.
New and enhanced in VMware vSphere Bitfusion 2.5
6 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
1 New and enhanced in VMware vSphere Bitfusion 2.5
Following are some of the important features that are included with the release of VMware vSphere Bitfusion
2.5:
Support for bare-metal server clients
Client support for virtual machines or servers running in an adjacent VMware vCenter cluster
Improved health checks
Introduction of new scripts that gather all server logs to send information to technical support
Upgrade to the existing backup and restore process to version 2.5
Support for earlier versions of VMware vSphere Bitfusion client software
The new features are described in the following sections of the deployment guide:
Remote clients
Bare-metal server clients
Improved health checks
Support for remote clients and bare-metal server
Support for backup and restore
Audience and scope
7 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
2 Audience and scope
This deployment guide includes step-by-step instructions for deployment and configuration of the VMware vSphere Bitfusion appliance on Dell EMC PowerEdge R740xd and C4140 rack servers.
This deployment guide makes certain assumptions about the prerequisite knowledge of the deployment personnel and the hardware they are using. This includes:
Use of Dell EMC servers and switches including the location of buttons, cables, and components in the hardware
Functional knowledge of the items in the Dell EMC owner's manuals for the products being used
Use of VMware products and the components or features of VMware vSphere
Data center infrastructure best practices in the areas of server, storage, networking, and
environmental considerations such as power and cooling
Installation, configuration and package management familiarity of CentOS
Familiarity with NVIDIA CUDA toolkit
The scope of this document excludes existing infrastructure components outside of the specific hardware and software that is mentioned in this guide. VMware vSphere Bitfusion support is not limited to the hardware models, configuration values, and software components versions used in this document. Dell EMC takes no responsibility for any issues that may be caused to existing infrastructure during deployment.
Overview
8 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
3 Overview
With the new VMware vSphere Bitfusion software, graphics processing units (GPU) are no longer isolated from other resources. GPUs are now shared in a virtualized pool of resources and you can access them through any virtual machine in the infrastructure as shown in Figure 1. Similar to processors and storage resources, GPU deployments can now benefit from optimized utilization, reduced Capex and Opex, and accelerated development and deployment of R&D resources. Data scientists and AI developers can benefit from how Bitfusion supports monitoring higher workloads.
BitFusion offers the following key features:
Dynamic GPU attach anywhere Bitfusion disaggregates your GPU compute and dynamically attaches GPUs anywhere in the
datacenter, just like attaching storage.
Fractional GPUs for efficiency Bitfusion enables use of any arbitrary fractions of GPUs. Support more users in the test and
development phase.
Standards based accelerator access Leverage GPUs across an infrastructure plus integrate evolving technologies as standards emerge.
Application run time virtualization Bitfusion attaches GPUs based on CUDA calls at run-time, maximizing utilization of GPU servers
anywhere in the network.
Any application Bitfusion is a transparent layer and runs with any workload in a Tensorflow or Pytorch ecosystem.
Overview
9 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
Bitfusion GPU sharing model
Deployment of VMware Bitfusion on the Dell EMC PowerEdge servers provide an infrastructure solution incorporating the best-in class hardware from Dell EMC with core VMware products. Virtualization of computation, storage, networking and accelerators is delivered on a cluster of PowerEdge servers. The combination of VMware vSphere Bitfusion software on the Dell EMC PowerEdge hardware described in this document has been validated in Dell EMC labs.
Component overview
10 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
4 Component overview
This section briefly describes the components that support VMware vSphere Bitfusion and their key capabilities to help you deploy the software.
4.1 DELL EMC PowerEdge R740xd server
The PowerEdge R740xd server provides the benefit of scalable storage performance and data set processing. This 2U, 2-socket platform brings you scalability and performance to adapt to a variety of applications. This platform could be configured with up to 3x V100 GPUs or 6x NVIDIA T4 GPUs, but also offers the flexibility to support additional configurations such as 24x 2.5” NVMe drives and two NVIDIA GPUs. As you scale your deployments, scale your productivity with embedded intelligence and automation from iDRAC9 and the entire Open Manage portfolio that is designed to simplify the IT lifecycle from deployment to retirement.
Key capabilities:
24 DIMM slots of DDR4 memory (RDIMM or LRDIMM),
Up to 24 SAS or SATA SSD or hard drive and NVMe PCIe SSDs
Boot device options such as BOSS
Double wide GPUs, up to 300W each, or single wide GPUs, up to 150W each
Front view of a Dell EMC PowerEdge R740xd
Rear view of a Dell EMC PowerEdge R740xd
4.2 DELL EMC PowerEdge C4140 server
PowerEdge C4140 is an incredibly dense purpose-built rack server designed to handle the most demanding technical computing workloads. With the 2nd Generation Intel® Xeon® Scalable processors and NVIDIA® Volta® technologies, the C4140 fills a key gap as a leading GPU-accelerated platform in the PowerEdge server portfolio to enable a scalable business architecture in a heterogeneous data center environment. With four double-width accelerators in just 1U of space, the C4140 delivers outstanding performance and maximum density while reducing your space, cost and management requirements.
Component overview
11 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
Key capabilities:
Unthrottled performance and superior thermal efficiency with patent-pending interleaved GPU system design*
No-compromise (CPU + GPU) acceleration technology up to 500 TFLOPS / U+ using the NVIDIA® Tesla™V100 with NVLink™
2.4KW PSUs help future-proof for next generation GPUs
Front view of a Dell EMC PowerEdge C4140
Rear view of a Dell EMC PowerEdge C4140
Internal view of a Dell EMC PowerEdge R740xd displaying a NVIDIA T4 GPU card
4.3 Dell EMC vSAN Ready Node R740xd
Dell EMC vSAN Ready Nodes are pre-configured building blocks that reduce deployment risks with certified configurations, improve storage efficiency by up to 50%, and can help you build or scale your vSAN cluster faster. Whether you're just getting started, and/or expanding your existing VMware environment, Dell EMC is here for you every step of the way with consulting, education, deployment and support services for the entire solution.
The Dell EMC vSAN Ready Node R740xd is a two socket, 2U rack servers designed to run complex workloads using highly scalable memory, I/O capacity and network options. The vSAN RN R740xd is available in All-Flash and Hybrid configurations and features the latest Generation Intel® Xeon® Scalable processor family. Being a vSAN RN all approve configurations can be access in the VMware Compatibility Guide. The vSAN RN R740xd adds extraordinary storage capacity options, making it well-suited for data­intensive applications that require greater storage, while not sacrificing I/O performance. Being that these are ready nodes, they take the guesswork and hassle out of procurement, deployment and management.
Component overview
12 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
4.4 Dell EMC Networking S5248F-ON Switch
The S5200F-ON series introduces optimized 25GbE and 100GbE open networking connectivity for servers/storage in demanding web and cloud environments. Innovative next-generation top-of-rack family of 25GbE switches providing optimized performance both in-rack and between-racks, cost-effective 50/100GbE leaf/spine fabric, and migration capabilities for future connectivity needs.
Key capabilities:
48 port 10/25GbE SFP28 auto negotiating ports
4 port 100GbE QSFP28
2 port 2 x 100 QSFPDD-28
Front view of Dell PowerSwitch S5248F-ON
4.5 NVIDIA T4 Datacenter GPU
The NVIDIA® T4 is a single-slot, low-profile, 6.6-inch PCI Express Gen3 Universal Deep Learning Accelerator based on the TU104 NVIDIA graphics processing unit (GPU). The T4 has 16 GB GDDR6 memory and a 70 W maximum power limit. The T4 is offered as a passively cooled board that requires system air flow to operate the card within its thermal limits.
NVIDIA T4 GPU
4.6 NVIDIA V100 for NVLINK and PCIe, Datacenter GPU
The NVIDIA V100 GPU powered by NVIDIA Volta architecture is the most widely used accelerator for scientific computing and artificial intelligence. NVIDIA® V100 Tensor Core is the most advanced data center
GPU ever built to accelerate AI, data science. It’s powered by NVIDIA Volta architecture, comes in 16 and
32GB configurations, and offers the performance of up to 32 CPUs in a single GPU. Deep Learning training workloads can leverage NVLink capability of the V100 SXM2 GPUs on the C4140 with NVLink. Using the V100 SXM2 GPU with the NVLink capabilities enables direct communication between GPUs with bandwidth of up to 300GB/s; further increasing performance of AI training workloads.
Component overview
13 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide
NVIDIA V100 for PCIE
NVIDIA V100 for NVLink
4.7 Mellanox ConnextX-5 Dual Port 10/25GbE Adapter
ConnectX-5 EN supports two ports of 25Gb Ethernet connectivity, sub-600 ns latency, and very high message rate, plus PCIe switch and NVMe over Fabric offloads, providing the highest performance and most flexible solution for the most demanding applications and markets: Machine Learning, Data Analytics, and more.
Key capabilities:
Up to 25 Gb/s connectivity per port
Industry-leading throughput, low latency, low CPU utilization and high message rate
RoCE for Overlay Networks
Mellanox ConnextX-5 Dual Port 10/25GbE Adapter
Loading...
+ 30 hidden pages