Dell R740xd User Manual

Deployment guide

VMware vSphere Bitfusion on Dell EMC PowerEdge servers

Abstract

VMware vSphere Bitfusion is a software solution that you can deploy on Dell EMC PowerEdge R740xd and C4140 servers. The solution virtualizes hardware resources to provide a pool of shared resources that are accessible to any virtual machine in the network.

March 2021

Deployment guide

Revisions

Revisions

Date

Description

 

 

September 2020

Initial release

 

 

March 2021

Document updated with support information for VMware vSphere Bitfusion 2.5

 

 

Acknowledgements

Authors: Jay Engh, Chris Gully

Support: Gurupreet Kaushik, Sherry Keller

The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.

Use, copying, and distribution of any software described in this publication requires an applicable software license.

Copyright © 03/05/2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners.

2 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

Table of contents

Table of contents

Revisions.............................................................................................................................................................................

 

2

Acknowledgements.............................................................................................................................................................

2

Table of contents ................................................................................................................................................................

3

Executive summary.............................................................................................................................................................

5

1

New and enhanced in VMware vSphere Bitfusion 2.5 .................................................................................................

6

2

Audience and scope.....................................................................................................................................................

7

3

Overview.......................................................................................................................................................................

8

4

Component overview..................................................................................................................................................

10

 

4.1

DELL EMC PowerEdge R740xd server ...........................................................................................................

10

 

4.2

DELL EMC PowerEdge C4140 server .............................................................................................................

10

 

4.3

Dell EMC vSAN Ready Node R740xd..............................................................................................................

11

 

4.4

Dell EMC Networking S5248F-ON Switch .......................................................................................................

12

 

4.5

NVIDIA T4 Datacenter GPU .............................................................................................................................

12

 

4.6

NVIDIA V100 for NVLINK and PCIe, Datacenter GPU ....................................................................................

12

 

4.7

Mellanox ConnextX-5 Dual Port 10/25GbE Adapter ........................................................................................

13

5

Pre-deployment requirements and introduction to new features ...............................................................................

14

 

5.1

GPU hosts ........................................................................................................................................................

14

 

5.2

Client cluster .....................................................................................................................................................

14

 

5.3

Introduction to new features .............................................................................................................................

15

 

5.3.1

Remote clients ..................................................................................................................................................

15

 

5.3.2

Bare-metal server clients..................................................................................................................................

15

 

5.3.3

Improved health checks....................................................................................................................................

16

 

5.4

Bitfusion server and client software..................................................................................................................

16

 

5.5

vCenter .............................................................................................................................................................

16

 

5.6

Client virtual machine .......................................................................................................................................

17

 

5.7

Connectivity ......................................................................................................................................................

17

 

5.8

Network services ..............................................................................................................................................

17

6

Solution overview .......................................................................................................................................................

18

 

6.1

Architecture.......................................................................................................................................................

18

 

6.2

Component information ....................................................................................................................................

19

 

6.3

VLANs and IP subnet information. ...................................................................................................................

19

7

Deployment and configuration....................................................................................................................................

20

 

7.1

Verify the GPU host hardware configuration ....................................................................................................

20

 

7.2

Add the GPU hosts and client cluster to the vCenter inventory .......................................................................

20

 

 

 

 

3 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

Table of contents

 

7.3

Prepare and configure the GPU hosts and client cluster for PVRDMA............................................................

20

7.3.1

Tag a VMkernel adapter for PVRDMA .............................................................................................................

21

7.3.2

Add and manage hosts on the Bitfusion distributed switch..............................................................................

22

7.4

Deploy the Open Virtual Appliance (OVA) to create the bitfusion-server-1 virtual machine ............................

23

7.5

Edit the bitfusion-server-1 hardware settings ...................................................................................................

27

7.6

Deploy the Open Virtual Appliance (OVA) to create the bitfusion-server-2 virtual machine ............................

32

7.7

Edit the bitfusion-server-2 virtual machine hardware settings..........................................................................

33

7.8

Deploy the OVA to create the bitfusion-server-3 virtual machine ....................................................................

35

7.9

Edit the bitfusion-server-3 virtual machine hardware settings..........................................................................

36

7.10

Provide client cluster access to GPU resources ..............................................................................................

38

7.11

Support for remote clients and bare-metal server ............................................................................................

40

7.12

Support for backup and restore ........................................................................................................................

40

8 Getting help ................................................................................................................................................................

43

8.1

Contacting Dell EMC ........................................................................................................................................

43

8.2

Documentation resources.................................................................................................................................

43

8.3

VMware Hands-On-Labs (HOLs) .....................................................................................................................

43

8.4

Documentation feedback..................................................................................................................................

43

4 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

Executive summary

Executive summary

Modern computational requirements have evolved and diverged with the wide acceptance and use of containers and virtualization. Today’s workloads are presenting challenges to the CPU-centric paradigm which has been a key determinant of customers' server performance and utility needs. General purpose CPUs no longer adequately handle these new workloads, whether they are artificial intelligence, machine learning or virtual desktops. However, CPU power can be enhanced with add-in graphics processing units (GPUs). GPUs leverage thousands of computing cores instead of the tens of cores that general purpose CPUs can address. While the cost of putting dedicated GPU hardware into every server is uneconomical, VMware vSphere Bitfusion on PowerEdge servers enables network delivery of these pooled resources to any properly configured client node. With Bitfusion built into the well-known vSphere infrastructure, administrators can optimize their use and increase the utilization of these expensive resources. This delivers immense value to those networked nodes which require powerful GPUs to execute massively parallel workloads.

5 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

New and enhanced in VMware vSphere Bitfusion 2.5

1 New and enhanced in VMware vSphere Bitfusion 2.5

Following are some of the important features that are included with the release of VMware vSphere Bitfusion 2.5:

Support for bare-metal server clients

Client support for virtual machines or servers running in an adjacent VMware vCenter cluster

Improved health checks

Introduction of new scripts that gather all server logs to send information to technical support

Upgrade to the existing backup and restore process to version 2.5

Support for earlier versions of VMware vSphere Bitfusion client software

The new features are described in the following sections of the deployment guide:

Remote clients

Bare-metal server clients

Improved health checks

Support for remote clients and bare-metal server

Support for backup and restore

6 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

Audience and scope

2 Audience and scope

This deployment guide includes step-by-step instructions for deployment and configuration of the VMware vSphere Bitfusion appliance on Dell EMC PowerEdge R740xd and C4140 rack servers.

This deployment guide makes certain assumptions about the prerequisite knowledge of the deployment personnel and the hardware they are using. This includes:

Use of Dell EMC servers and switches including the location of buttons, cables, and components in the hardware

Functional knowledge of the items in the Dell EMC owner's manuals for the products being used

Use of VMware products and the components or features of VMware vSphere

Data center infrastructure best practices in the areas of server, storage, networking, and environmental considerations such as power and cooling

Installation, configuration and package management familiarity of CentOS

Familiarity with NVIDIA CUDA toolkit

The scope of this document excludes existing infrastructure components outside of the specific hardware and software that is mentioned in this guide. VMware vSphere Bitfusion support is not limited to the hardware models, configuration values, and software components versions used in this document. Dell EMC takes no responsibility for any issues that may be caused to existing infrastructure during deployment.

7 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

Overview

3 Overview

With the new VMware vSphere Bitfusion software, graphics processing units (GPU) are no longer isolated from other resources. GPUs are now shared in a virtualized pool of resources and you can access them through any virtual machine in the infrastructure as shown in Figure 1. Similar to processors and storage resources, GPU deployments can now benefit from optimized utilization, reduced Capex and Opex, and accelerated development and deployment of R&D resources. Data scientists and AI developers can benefit from how Bitfusion supports monitoring higher workloads.

BitFusion offers the following key features:

Dynamic GPU attach anywhere

Bitfusion disaggregates your GPU compute and dynamically attaches GPUs anywhere in the datacenter, just like attaching storage.

Fractional GPUs for efficiency

Bitfusion enables use of any arbitrary fractions of GPUs. Support more users in the test and development phase.

Standards based accelerator access

Leverage GPUs across an infrastructure plus integrate evolving technologies as standards emerge.

Application run time virtualization

Bitfusion attaches GPUs based on CUDA calls at run-time, maximizing utilization of GPU servers anywhere in the network.

Any application

Bitfusion is a transparent layer and runs with any workload in a Tensorflow or Pytorch ecosystem.

8 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

Overview

Bitfusion GPU sharing model

Deployment of VMware Bitfusion on the Dell EMC PowerEdge servers provide an infrastructure solution incorporating the best-in class hardware from Dell EMC with core VMware products. Virtualization of computation, storage, networking and accelerators is delivered on a cluster of PowerEdge servers. The combination of VMware vSphere Bitfusion software on the Dell EMC PowerEdge hardware described in this document has been validated in Dell EMC labs.

9 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

Component overview

4 Component overview

This section briefly describes the components that support VMware vSphere Bitfusion and their key capabilities to help you deploy the software.

4.1DELL EMC PowerEdge R740xd server

The PowerEdge R740xd server provides the benefit of scalable storage performance and data set processing. This 2U, 2-socket platform brings you scalability and performance to adapt to a variety of applications. This platform could be configured with up to 3x V100 GPUs or 6x NVIDIA T4 GPUs, but also offers the flexibility to support additional configurations such as 24x 2.5” NVMe drives and two NVIDIA GPUs.

As you scale your deployments, scale your productivity with embedded intelligence and automation from iDRAC9 and the entire Open Manage portfolio that is designed to simplify the IT lifecycle from deployment to retirement.

Key capabilities:

24 DIMM slots of DDR4 memory (RDIMM or LRDIMM),

Up to 24 SAS or SATA SSD or hard drive and NVMe PCIe SSDs

Boot device options such as BOSS

Double wide GPUs, up to 300W each, or single wide GPUs, up to 150W each

Front view of a Dell EMC PowerEdge R740xd

Rear view of a Dell EMC PowerEdge R740xd

4.2DELL EMC PowerEdge C4140 server

PowerEdge C4140 is an incredibly dense purpose-built rack server designed to handle the most demanding technical computing workloads. With the 2nd Generation Intel® Xeon® Scalable processors and NVIDIA® Volta® technologies, the C4140 fills a key gap as a leading GPU-accelerated platform in the PowerEdge server portfolio to enable a scalable business architecture in a heterogeneous data center environment. With four double-width accelerators in just 1U of space, the C4140 delivers outstanding performance and maximum density while reducing your space, cost and management requirements.

10 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

Dell R740xd User Manual

Component overview

Key capabilities:

Unthrottled performance and superior thermal efficiency with patent-pending interleaved GPU system design*

No-compromise (CPU + GPU) acceleration technology up to 500 TFLOPS / U+ using the NVIDIA®

Tesla™V100 with NVLink™

2.4KW PSUs help future-proof for next generation GPUs

Front view of a Dell EMC PowerEdge C4140

Rear view of a Dell EMC PowerEdge C4140

Internal view of a Dell EMC PowerEdge R740xd displaying a NVIDIA T4 GPU card

4.3Dell EMC vSAN Ready Node R740xd

Dell EMC vSAN Ready Nodes are pre-configured building blocks that reduce deployment risks with certified configurations, improve storage efficiency by up to 50%, and can help you build or scale your vSAN cluster faster. Whether you're just getting started, and/or expanding your existing VMware environment, Dell EMC is here for you every step of the way with consulting, education, deployment and support services for the entire solution.

The Dell EMC vSAN Ready Node R740xd is a two socket, 2U rack servers designed to run complex workloads using highly scalable memory, I/O capacity and network options. The vSAN RN R740xd is available in All-Flash and Hybrid configurations and features the latest Generation Intel® Xeon® Scalable processor family. Being a vSAN RN all approve configurations can be access in the VMware Compatibility Guide. The vSAN RN R740xd adds extraordinary storage capacity options, making it well-suited for dataintensive applications that require greater storage, while not sacrificing I/O performance. Being that these are ready nodes, they take the guesswork and hassle out of procurement, deployment and management.

11 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

Component overview

4.4Dell EMC Networking S5248F-ON Switch

The S5200F-ON series introduces optimized 25GbE and 100GbE open networking connectivity for servers/storage in demanding web and cloud environments. Innovative next-generation top-of-rack family of 25GbE switches providing optimized performance both in-rack and between-racks, cost-effective 50/100GbE leaf/spine fabric, and migration capabilities for future connectivity needs.

Key capabilities:

48 port 10/25GbE SFP28 auto negotiating ports

4 port 100GbE QSFP28

2 port 2 x 100 QSFPDD-28

Front view of Dell PowerSwitch S5248F-ON

4.5NVIDIA T4 Datacenter GPU

The NVIDIA® T4 is a single-slot, low-profile, 6.6-inch PCI Express Gen3 Universal Deep Learning Accelerator based on the TU104 NVIDIA graphics processing unit (GPU). The T4 has 16 GB GDDR6 memory and a 70 W maximum power limit. The T4 is offered as a passively cooled board that requires system air flow to operate the card within its thermal limits.

NVIDIA T4 GPU

4.6NVIDIA V100 for NVLINK and PCIe, Datacenter GPU

The NVIDIA V100 GPU powered by NVIDIA Volta architecture is the most widely used accelerator for scientific computing and artificial intelligence. NVIDIA® V100 Tensor Core is the most advanced data center

GPU ever built to accelerate AI, data science. It’s powered by NVIDIA Volta architecture, comes in 16 and

32GB configurations, and offers the performance of up to 32 CPUs in a single GPU. Deep Learning training workloads can leverage NVLink capability of the V100 SXM2 GPUs on the C4140 with NVLink. Using the V100 SXM2 GPU with the NVLink capabilities enables direct communication between GPUs with bandwidth of up to 300GB/s; further increasing performance of AI training workloads.

12 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

Component overview

NVIDIA V100 for PCIE

NVIDIA V100 for NVLink

4.7Mellanox ConnextX-5 Dual Port 10/25GbE Adapter

ConnectX-5 EN supports two ports of 25Gb Ethernet connectivity, sub-600 ns latency, and very high message rate, plus PCIe switch and NVMe over Fabric offloads, providing the highest performance and most flexible solution for the most demanding applications and markets: Machine Learning, Data Analytics, and more.

Key capabilities:

Up to 25 Gb/s connectivity per port

Industry-leading throughput, low latency, low CPU utilization and high message rate

RoCE for Overlay Networks

Mellanox ConnextX-5 Dual Port 10/25GbE Adapter

13 VMware vSphere Bitfusion on Dell EMC PowerEdge servers | Deployment guide

Loading...
+ 30 hidden pages