HP 489183-B21, 4X Using Manual

Page 1

Using InfiniBand for a scalable compute infrastructure

technology brief, 3rd edition

Abstract .............................................................................................................................................. 2

Introduction ......................................................................................................................................... 2

InfiniBand technology ........................................................................................................................... 4

InfiniBand components ...................................................................................................................... 5

InfiniBand software architecture ......................................................................................................... 5

MPI ............................................................................................................................................. 7

IPoIB ........................................................................................................................................... 7

RDMA-based protocols .................................................................................................................. 7

RDS ............................................................................................................................................ 8

InfiniBand hardware architecture ....................................................................................................... 8

Link operation .................................................................................................................................. 9

Scale-out clusters built on InfiniBand and HP technology ......................................................................... 11

Conclusion ........................................................................................................................................ 13

Appendix A: Glossary ........................................................................................................................ 14

For more information .......................................................................................................................... 15

Call to action .................................................................................................................................... 15

Page 2

CPU(s)

Memory

Controller

Processor 1

Memory

PCI-X

Bridge

PCIe Port

Controller

Ultra-3 SCSI

Fibre Channel

Gigabit Ethernet

320 MBps

400 MBps

> 1 Gbps

Memory

Controller

Processor 2

CPU(s)

Processor

Input/Output

InfiniBand

> 10 Gbps*

* 4x link, single data rate

Abstract

With business models constantly changing to keep pace with today’s Internet-based, global economy, IT organizations are continually challenged to provide customers with high performance platforms while controlling their cost. An increasing number of enterprise businesses are implementing scale-out architectures as a cost-effective approach for scalable platforms, not just for high performance computing (HPC) but also for financial services and Oracle-based database applications.

InfiniBand (IB) is one of the most important technologies that enable the adoption of cluster computing. This technology brief describes InfiniBand as an interconnect technology used in cluster computing, provides basic technical information, and explains the advantages of implementing InfiniBand-based scale-out architectures.

Introduction

The overall performance of enterprise servers is determined by the synergetic relationship between three main subsystems: processing, memory, and input/output. The multiprocessor architecture used in the latest single-server systems (Figure 1) provides a high degree of parallel processing capability. However, multiprocessor server architecture cannot scale cost effectively to a large number of processing cores. Scale-out cluster computing that builds an entire system by connecting stand-alone systems with interconnect technology has become widely implemented for HPC and enterprise data centers around the world.

Figure 1. Architecture of a dual-processor single server (node)

Page 3

Figure 2 shows an example of cluster architecture that integrates computing, storage, and visualization functions into a single system. Applications are usually distributed to compute nodes through job scheduling tools.

Figure 2. Sample clustering architecture

Scale-out systems allow infrastructure architects to meet performance and cost goals, but interconnect performance, scalability, and reliability are key areas that must be carefully considered. A cluster infrastructure works best when built with an interconnect technology that scales easily, reliably, and economically with system expansion.

Ethernet is a pervasive, mature interconnect technology that can be cost-effective for some application workloads. The emergence of 10-Gigabit Ethernet (10GbE) offers a cluster interconnect that meets higher bandwidth requirements than 1GbE can provide. However, 10GbE still lags the latest InfiniBand technology in latency and bandwidth performance, and lacks native support for the fat-tree and mesh topologies used in scale-out clusters. InfiniBand remains the interconnect of choice for highly parallel environments where applications require low latency and high bandwidth across the entire fabric.

Page 4

InfiniBand Link

InfiniBand technology

InfiniBand is an industry-standard, channel-based architecture in which congestion-management capabilities and zero-copy data transfers using remote direct memory access (RDMA) are core capabilities—resulting in high-speed, low-latency interconnects for scale-out compute infrastructures.

InfiniBand uses a multi-layer architecture to transfer data from one node to another. In the InfiniBand layer model (Figure 3), separate layers perform different tasks in the message passing process.

The upper layer protocols (ULPs) work closest to the operating system and application; they define the services and affect how much software overhead the data transfer will require. The InfiniBand transport layer is responsible for communication between applications. The transport layer splits the messages into data payloads and encapsulates each data payload and an identifier of the destination node into one or more packets. Packets can contain data payloads of up to four kilobytes.

The packets are passed to the network layer, which selects a route to the destination node and, if necessary, attaches the route information to the packets. The data link layer attaches a local identifier (LID) to the packet for communication at the subnet level. The physical layer transforms the packet into an electromagnetic signal based on the type of network media—copper or fiber.

Figure 3. Distributed computing using InfiniBand architecture

InfiniBand has these important characteristics:

Very high bandwidth—up to 40Gbps Quad Data Rate (QDR) Low latency end-to-end communication—MPI ping-pong latency approaching 1 microsecond Hardware-based protocol handling, resulting in faster throughput and low CPU overhead due to

efficient OS bypass and RDMA

Native support for fat-tree and other common mesh topologies in fabric design

Page 5

InfiniBand components

InfiniBand architecture involves four key components:

Host channel adapter Subnet manager Target channel adapter InfiniBand switch

A host node or server requires a host channel adapter (HCA) to connect to an InfiniBand infrastructure. An HCA can be a card installed in an expansion slot or integrated onto the host’s system board. An HCA can communicate directly with another HCA, with a target channel adapter, or with an InfiniBand switch.

InfiniBand uses subnet manager (SM) software to manage the InfiniBand fabric and to monitor interconnect performance and health at the fabric level. A fabric can be as simple as a point-to-point connection or multiple connections through one or more switches. The SM software resides on a node or switch within the fabric, and provides switching and configuration information to all of the switches in the fabric. Additional backup SMs may be located within the fabric for failover should the primary SM fail. All other nodes in the fabric will contain an SM agent that processes management data. Managers and agents communicate using management datagrams (MADs).

A target channel adapter (TCA) is used to connect an external storage unit or I/O interface to an InfiniBand infrastructure. The TCA includes an I/O controller specific to the device’s protocol (SCSI, Fibre Channel, Ethernet, etc.) and can communicate with an HCA or an InfiniBand switch.

An InfiniBand switch provides scalability by allowing a number of HCAs, TCAs, and other IB switches to connect to an InfiniBand infrastructure. The switch handles network traffic by checking the local link header of each data packet received and forwarding the packet to the proper destination.

The most basic InfiniBand infrastructure will consist of host nodes or servers equipped with HCAs, an InfiniBand switch, and subnet manager software. More expansive networks will include multiple switches.

InfiniBand software architecture

InfiniBand, like Ethernet, uses a multi-layer processing stack to transfer data between nodes. InfiniBand architecture, however, provides OS-bypass features such as the communication processing duties and RDMA operations as core capabilities and offers greater adaptability through a variety of services and protocols.

While the majority of existing InfiniBand clusters operate on the Linux platform, drivers and HCA stacks are also available for Microsoft® Windows®, HP-UX, Solaris, and other operating systems from various InfiniBand hardware and software vendors.

The layered software architecture of the HCA allows writing code without specific hardware in mind. The functionality of an HCA is defined by its verb set, which is a table of commands used by the application programming interface (API) of the operating system being run. A number of services and software protocols are available (Figure 4) and, depending on type, can be implemented from user space or from the kernel.

Page 6

User APIs

uDAPL

SDP Library

MAD API

Open Fabrics Verbs, CMA, and API

IPoIB

Upper Level Protocols

Provider

User space

Kernel space

Hardware-Specific Driver

Hardware

MPIs

Open SM

Application Level

Mid-Layer Modules

Connection Manager

SA Client

MAD Services

Diag Tools

RDS

RDMA-based Protocols

(kDAPL, SDP, SRP, iSER, NFS)

InfiniBand HCA

Open Fabrics Verbs and API

SMA

Clustered DB

VNIC

IP-Based Access

Sockets-Based Access

Block Storage Access

File System Access

Figure 4. InfiniBand software layers

As indicated in Figure 4, InfiniBand supports a variety of upper level protocols (ULPs) and libraries that have evolved since the introduction of InfiniBand. The key protocols are discussed in the following pages.

Page 7

MPI

The message passing interface (MPI) protocol is a library of calls used by applications in a parallel computing environment to communicate between nodes. MPI calls are optimized for performance in a compute cluster that takes advantage of high-bandwidth and low-latency interconnects. In parallel computing environments, code is executed across multiple nodes simultaneously. MPI facilitates the communication and synchronization among these jobs across the entire cluster.

To take advantage of the features of MPI, an application must be written and compiled to include the libraries from the particular MPI implementation used. Several implementations of MPI are on the market:

HP-MPI Intel MPI Publicly available versions such as MVAPICH2 and Open MPI

MPI has become the de-facto IB ULP standard. In particular, HP-MPI has been accepted by more independent software vendors (ISVs) than any other commercial MPI. By using shared libraries, applications built on HP-MPI can transparently select interconnects that significantly reduce the effort for applications to support various popular interconnect technologies. HP-MPI is supported on HP-UX, Linux, True64 UNIX, and Microsoft Windows Compute Cluster Server 2003.

IPoIB

Internet Protocol over InfiniBand (IPoIB) allows the use of TCP/IP or UDP/IP-based applications between nodes connected to an InfiniBand fabric. IPoIB supports IPv4 or IPv6 protocols and addressing schemes. An InfiniBand HCA is configured through the operating system as a traditional network adapter that can use all of the standard IP-based applications such as PING, FTP, and TELNET. IPoIB does not support the RDMA features of InfiniBand. Communication between IB nodes using IPoIB and Ethernet nodes using IP will require a gateway/router interface.

RDMA-based protocols

DAPL - The Direct Access Programming Library (DAPL) allows low-latency RDMA communications between nodes. The uDAPL provides user-level access to RDMA functionality on InfiniBand while the kDAPL provides the kernel-level API. To use RDMA for the data transfers between nodes, applications must be written with a specific DAPL implementation.

SDP – Sockets Direct Protocol (SDP) is an RDMA protocol that that operates from the kernel. Applications must be written to take advantage of the SDP interface. SDP is based on the WinSock Direct Protocol used by Microsoft server operating systems and is suited for connecting databases to application servers.

SRP – SCSI RDMA Protocol (SRP) is a data movement protocol that encapsulates SCSI commands over InfiniBand for SAN networking. Operating from the kernel level, SRP allows copying SCSI commands between systems using RDMA for low-latency communications with storage systems.

iSER – iSCSI Enhanced RDMA (iSER) is a storage standard originally specified on the iWARP RDMA technology and now officially supported on InfiniBand. The iSER protocol provides iSCSI manageability to RDMA storage operations.

NFS – The Network File System (NFS) is a storage protocol that has evolved since its inception in the 1980s, undergoing several generations of development while remaining network-independent. With the development of high-performance I/O such as PCIe and the significant advances in memory subsystems, NFS over RDMA on InfiniBand offers low-latency performance for transparent file sharing across different platforms.

Page 8

Link

SDR Signal rate

DDR Signal rate

QDR Signal rate

2.5 Gbps

5 Gbps

10 Gbps

20 Gbps

40 Gbps

12x

30 Gbps

60 Gbps

120 Gbps

1x Link

(1 send channel, 1 receive channel, each using differential voltage or fiber optic signaling)

4x Link

12x Link

RDS

The Reliable Datagram Sockets (RDS) is a low-overhead, low-latency, high-bandwidth transport protocol that yields high performance for cluster interconnect-intensive environments such as Oracle Database or Real Application Cluster (RAC). The RDS protocol offloads error-checking operations to the InfiniBand fabric, freeing more CPU time for application processing.

InfiniBand hardware architecture

InfiniBand fabrics use high-speed, bi-directional serial interconnects between devices. Interconnect bandwidth and distances are characteristics determined by the type of cabling and connections employed. The bi-directional links contain dedicated send and receive lanes for full duplex operation.

For quad data rate (QDR) operation, each lane has a signaling rate of 10 Gbps. Bandwidth is increased by adding more lanes per link. InfiniBand interconnect types include 1x, 4x, or 12x wide full-duplex links (Figure 5)., The 4x is the most popular configuration and provides a theoretical fullduplex QDR bandwidth of 80 (2 x 40) gigabits per second.

Figure 5. InfiniBand link types

Encoding overhead in the data transmission process limits the maximum data bandwidth per link to approximately 80 percent of the signal rate. However, the switched fabric design of InfiniBand allows bandwidth to grow or aggregate as links and nodes are added. Double data rate (DDR) and especially quad data rate (QDR) operation increase bandwidth significantly (Table 1).

Table 1. InfiniBand interconnect bandwidth

Page 9

Sources

Targets

QSFP

CX4

The operating distances of InfiniBand interconnects are contingent upon the type of cabling (copper or fiber optic), the connector, and the signal rate. The common connectors used in InfiniBand interconnects today are CX4 and quad small-form-factor pluggable (QSFP) as shown in Figure 6.

Figure 6. InfiniBand connectors

Fiber optic cable with CX4 connectors generally offers the greatest distance capability. The adoption of 4X DDR products is widespread, and deployment of QDR systems is expected to increase.

Link operation

Each link can be divided (multiplexed) into a set of virtual lanes, similar to highway lanes (Figure 7). Each virtual lane provides flow control and allows a pair of devices to communicate autonomously. Typical implementations have each link accommodating eight lanes1; one lane is reserved for fabric management and the other lanes for packet transport. The virtual lane design allows an InfiniBand link to share bandwidth between various sources and targets simultaneously. For example, if a 10Gb/s link were divided into five virtual lanes, each lane would have a bandwidth of 2Gb/s. The InfiniBand architecture defines a virtual lane mapping algorithm to ensure inter-operability between end nodes that support different numbers of virtual lanes.

Figure 7. InfiniBand virtual lane operation

The IBTA specification defines a minimum of two and a maximum of 16 virtual lanes per link.

Page 10

When a connection between two channel adapters is established, one of the following transport layer communication protocols is selected:

Reliable connection (RC) – data transfer between two entities using receive acknowledgment Unreliable connection (UC) – same as RC but without acknowledgement (rarely used) Reliable datagram (RD) – data transfer using RD channel between RD domains Unreliable datagram (UD) – data transfer without acknowledgement Raw packets (RP) – transfer of datagram messages that are not interpreted

These protocols can be implemented in hardware; some protocols more efficient than others. The UD and Raw protocols, for instance, are basic datagram movers and may require system processor support depending on the ULP used.

When the reliable connection protocol is operating (Figure 8), hardware at the source generates packet sequence numbers for every packet sent, and the hardware at the destination checks the sequence numbers and generates acknowledgments for every packet sequence number received. The hardware also detects missing packets, rejects duplicate packets, and provides recovery services for failures in the fabric.

Figure 8. Link operation using reliable connection protocol

Page 11

The programming model for the InfiniBand transport assumes that an application accesses at least one Send and one Receive queue to initiate the I/O. The transport layer supports four types of data transfers for the Send queue:

Send/Receive – Typical operation where one node sends a message and another node receives the

message

RDMA Write – Operation where one node writes data directly into a memory buffer of a remote

node

RDMA Read – Operation where one node reads data directly from a memory buffer of a remote

node

RDMA Atomics – Allows atomic update of a memory location from an HCA perspective.

The only operation available for the receive queue is Post Receive Buffer transfer, which identifies a buffer that a client may send to or receive from using a Send or RDMA Write data transfer.

Scale-out clusters built on InfiniBand and HP technology

In the past few years, scale-out cluster computing has become a mainstream architecture for high performance computing. As the technology becomes more mature and affordable, scale-out clusters are being adopted in a broader market beyond HPC. HP Oracle Database Machine is one example. The trend in this industry is toward using space- and power-efficient blade systems as building blocks for scale-out solutions . HP BladeSystem c-Class solutions offer significant savings in power, cooling, and data center floor space without compromising performance.

The c7000 enclosure supports up to 16 half-height or 8 full-height server blades and includes rear mounting bays for management and interconnect components. Each server blade includes mezzanine connectors for I/O options such as the HP 4x QDR IB mezzanine card. HP c-Class server blades are available in two form-factors and server node configurations to meet various density goals. To meet extreme density goals, the half-height HP BL2x220c server blade includes two server nodes. Each node can support two quad-core Intel Xeon 5400-series processors and a slot for a mezzanine board, providing a maximum of 32 nodes and 256 cores per c7000 enclosure.

NOTE:

The DDR HCA mezzanine card should be installed in a PCIe x8 connector for maximum InfiniBand performance. The QDR HCA mezzanine card is supported on the ProLiant G6 blades with PCIe x8 Gen 2 mezzanine connectors

Figure 9 shows a full bandwidth fat-tree configuration of HP BladeSystem c-Class components providing 576 nodes in a cluster. Each c7000 enclosure includes an HP 4x QDR IB Switch, which provides 16 downlinks for server blade connection and 16 QSFP uplinks for fabric connectivity. Spine-level fabric connectivity is provided through sixteen 36-port Voltaire 4036 QDR InfiniBand Switches2. The Voltaire 36-port switches provide 40-Gbps (per port) performance and offer fabric management capabilities.

Qualified, marketed, and supported by HP.

Page 12

HP c7000 Enclosure

16 HP BL280c G6

server blades

w/4x QDR HCAs

36-Port QDR

IB Switch

HP c7000 Enclosure

16 HP BL280c G6

server blades

w/4x QDR HCAs

HP c7000 Enclosure

16 HP BL280c G6

server blades

w/4x QDR HCAs

HP 4x QDR IB

Interconnect Switch

HP 4x QDR IB

Interconnect Switch

HP 4x QDR IB

Interconnect Switch

Total nodes

576 (1 per blade)

Total processor cores

4608 (2 Nehalem processors per node, 4 cores per processor

Memory

28 TB w/4 GB DIMMs (48 GB per node) or 55 TB w/ 8 GB DIMMS (96 GB per node)

Storage

2 NHP SATA or SAS per node

Interconnect

1:1 full bandwidth (non-blocking), 3 switch hops maximum, fabric redundancy

36-Port QDR

IB Switch

36-Port QDR

IB Switch

Figure 9. HP BladeSystem c-Class 576-node cluster configuration

The HP Unified Cluster Portfolio includes a range of hardware, software, and services that provide customers a choice of pre-tested, pre-configured systems for simplified implementation, fast deployment, and standardized support.

HP solutions optimized for HPC:

HP Cluster Platforms – flexible, factory integrated/tested systems built around specific platforms,

backed by HP warranty and support, and built to uniform, worldwide specifications

HP Scalable File Share (HP SFS) – high-bandwidth, scalable HP storage appliance for Linux clusters HP Financial Services Industry (FSI) solutions – defined solution stacks and configurations for real-

time market data systems

HP and partner solutions optimized for scale-out database applications:

HP Oracle Exadata Storage HP Oracle Database Machine HP BladeSystem for Oracle Optimized Warehouse (OOW)

HP Cluster Platforms are built around specific hardware and software platforms and offer a choice of interconnects. For example, the HP Cluster Platform CL3000BL uses the HP BL2x220c G5, BL280c G6, and BL460c blade servers as the compute node with a choice of GbE or InfiniBand interconnects. No longer unique to Linux or HP-UX environments, HPC clustering is now supported through Microsoft Windows Server HPC 2003, with native support for HP-MPI.

Page 13

Conclusion

InfiniBand offers an industry standard, high-bandwidth, low-latency, scalable interconnect with a high degree of connectivity between servers connected to a fabric. While zero-copy (RDMA) protocols have been applied to TCP/IP networks such as Ethernet, RDMA is a core capability of InfiniBand architecture. Flow control support is native to the HCA design, and the latency time for InfiniBand data transfers is generally less—approaching 1 microsecond for MPI pingpong latency—than that for 10Gb Ethernet.

InfiniBand provides native support for fat-tree and other mesh topologies, allowing simultaneous connections across multiple links. This gives the fabric the ability to scale or aggregate bandwidth as more nodes and/or additional links are connected.

InfiniBand is further strengthened by HP-MPI becoming the leading solution among ISVs for developing and running MPI-based applications across multiple platforms and interconnect types. Software development and support become simplified since interconnects from a variety of vendors can be supported by an application written to the HP-MPI protocol.

Parallel compute applications that involve a high degree of message passing between nodes benefit significantly from InfiniBand. HP BladeSystem c-Class clusters and similar rack-mounted clusters support IB QDR and DDR HCAs and switches.

InfiniBand offers solid growth potential in performance, with DDR infrastructure currently accepted as mainstream, QDR becoming available, and Eight Data Rate (EDR) with a per-port rate of 80 Gbps being discussed by the InfiniBand Trade Association (IBTA) as the next target level.

The decision to use Ethernet or InfiniBand should be based on interconnect performance and cost requirements. HP is committed to support both InfiniBand and Ethernet infrastructures, and to help customers choose the most cost-effective fabric interconnect solution.

Page 14

Acronym or abbreviation

Description

API

Application Programming Interface: software routine or object written in support of a language

DDR

Double Data Rate: for InfiniBand, clock rate of 5.0 Gbps (2.5 Gbps x 2)

GbE

Gigabit Ethernet: Ethernet network operating at 1 Gbps or greater

HCA

Host Channel Adapter: hardware interface connecting a server node to the IB network

HPC

High Performance Computing: the use of computer and/or storage clusters

InfiniBand: interconnect technology for distributed computing/storage infrastructure

Internet Protocol: standard of data communication over a packet-switched network

IPoIP

Internet Protocol over InfiniBand: protocol allowing the use of TCP/IP over IB networks

iSER

iSCSI Enhanced RDMA: file storage protocol

iWARP

Internet Wide Area RDMA Protocol: protocol enhancement allowing RDMA over TCP

MPI

Message Passing Interface: library of calls used by applications in parallel compute systems

NFS

Network File System: file storage protocol

NHP

Non Hot Pluggable (drive)

QDR

Quad Data Rate: for InfiniBand, clock rate of 10 Gbps (2.5 Gbps x 4)

QSFP

Quad Small Form factor Pluggable: interconnect connector type

RDMA

Remote Direct Memory Access: protocol allowing data movement in and out of system memory without CPU intervention

RDS

Reliable Datagram Sockets: transport protocol

SATA

Serial ATA (hard drive)

SAS

Serial Attached SCSI (hard drive)

SDP

Sockets Direct Protocol: kernel-level RDMA-based protocol

SDR

Single Data Rate: for InfiniBand, standard clock rate of 2.5 Gbps

Subnet Manager: management software for InfiniBand network

SRP

SCSI RDMA Protocol: data movement protocol

TCA

Target Channel Adapter: hardware interface connecting a storage or I/O node to the IB network

TCP

Transmission Control Protocol: core high-level protocol for packet-based data communication

TOE

TCP Offload Engine: accessory processor and/or driver that assumes TCP/IP duties from system CPU

ULP

Upper Level Protocol: protocol layer that defines the method of data transfer over InfiniBand

VNIC

Virtual Network Interface Controller: software interface that allows a host on an InfiniBand fabric access to nodes on an external Ethernet network

Appendix A: Glossary

The following table lists InfiniBand-related acronyms and abbreviations used in this document.

Table 1. Acronyms and abbreviations

Page 15

Resource

Hyperlink

HP products

www.hp.com

HPC/IB/cluster products

www.hp.com/go/hptc

HP InfiniBand products

http://h18004.www1.hp.com/products/servers/networki ng/index-ib.html

InfiniBand Trade Organization

http://www.infinibandta.org

Open Fabrics Alliance

http://www.openib.org/

RDMA Consortium

http://www.rdmaconsortium.org.

Technology brief discussing iWARP RDMA

http://h20000.www2.hp.com/bc/docs/support/Support Manual/c00589475/c00589475.pdf

HP BladeSystem

http://h18004.www1.hp.com/products/blades/compone nts/c-class-tech-function.html

© 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.

Microsoft, Windows, and Windows NT are US registered trademarks of Microsoft Corporation.

Linux is a U.S. registered trademark of Linux Torvalds.

TC090403TB, April 2009

For more information

For additional information, refer to the resources listed below.

Call to action

Send comments about this paper to: TechCom@HP.com.

HP 489183-B21, 4X Using Manual

Specifications and Main Features

Frequently Asked Questions

User Manual