Qlogic OFED+ Host, OFED+ Host 1.5.4 Software User's Manual

OFED+ Host Software
Release 1.5.4
User Guide
IB0054606-02 A
OFED+ Host Software Release 1.5.4 User Guide
Information furnished in this manual is believed to be accurate and reliable. However, QLogic Corporation assumes no responsibility for its use, nor for any infringements of patents or other rights of third parties which may result from its use. QLogic Corporation reserves the right to change product specifications at any time without notice. Applications described in this document for any of these products are for illustrative purposes only. QLogic Corporation makes no representation nor warranty that such applications are suitable for the specified use without further testing or modification. QLogic Corporation assumes no responsibility for any errors that may appear in this document.
Document Revision History
Revision A, April 2012
Changes Sections Affected
ii IB0054606-02 A

Table of Contents

Preface
Intended Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Related Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Documentation Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
License Agreements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Technical Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Contact Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Knowledge Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
1 Introduction
How this Guide is Organized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
2 Step-by-Step Cluster Setup and MPI Usage Checklists
Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
Using MPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
3InfiniBand
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
Installed Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
IB and OpenFabrics Driver Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
IPoIB Network Interface Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
IPoIB Administration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
Administering IPoIB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
Configuring IPoIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
IB Bonding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
Interface Configuration Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
Verify IB Bonding is Configured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
®
Cluster Setup and Administration
Stopping, Starting and Restarting the IPoIB Driver. . . . . . . . . . . 3-5
Editing the IPoIB Configuration File . . . . . . . . . . . . . . . . . . . . . . 3-5
Red Hat EL5 and EL6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
SuSE Linux Enterprise Server (SLES) 10 and 11. . . . . . . . . . . . 3-8
IB0054606-02 A iii
OFED+ Host Software Release 1.5.4 User Guide
Subnet Manager Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
QLogic Distributed Subnet Administration . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
Applications that use Distributed SA . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
Virtual Fabrics and the Distributed SA. . . . . . . . . . . . . . . . . . . . . . . . . 3-13
Configuring the Distributed SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
Default Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
Multiple Virtual Fabrics Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
Virtual Fabrics with Overlapping Definitions . . . . . . . . . . . . . . . . . . . . 3-15
Distributed SA Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
SID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18
ScanFrequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18
LogFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18
Dbg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19
Other Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19
Changing the MTU Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
Managing the ib_qib Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
Configure the ib_qib Driver State. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
Start, Stop, or Restart ib_qib Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
Unload the Driver/Modules Manually. . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
ib_qib Driver Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
More Information on Configuring and Loading Drivers. . . . . . . . . . . . . . . . . 3-24
Performance Settings and Management Tips . . . . . . . . . . . . . . . . . . . . . . . 3-24
Performance Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25
Systems in General (With Either Intel or AMD CPUs) . . . . . . . . 3-25
AMD CPU Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
AMD Interlagos CPU Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
Intel CPU Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
High Risk Tuning for Intel Harpertown CPUs . . . . . . . . . . . . . . . 3-30
Additional Driver Module Parameter Tunings Available . . . . . . . 3-31
Performance Tuning using ipath_perf_tuning Tool . . . . . . . . . . . 3-34
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35
AUTOMATIC vs. INTERACTIVE MODE. . . . . . . . . . . . . . . . . . . 3-36
Affected Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37
Homogeneous Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37
Adapter and Other Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-38
Remove Unneeded Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-39
Host Environment Setup for MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-40
iv IB0054606-02 A
Configuring for ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-40
Configuring ssh and sshd Using shosts.equiv . . . . . . . . . . 3-40
Configuring for ssh Using ssh-agent . . . . . . . . . . . . . . . . . . . 3-43
Process Limitation with ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-44
Checking Cluster and Software Status. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-44
ipath_control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-44
iba_opp_query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45
ibstatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-46
ibv_devinfo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47
ipath_checkout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47
4 Running MPI on QLogic Adapters
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
MPIs Packaged with QLogic OFED+ . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Open MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Compiling Open MPI Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Create the mpihosts File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Running Open MPI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Further Information on Open MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
Configuring MPI Programs for Open MPI . . . . . . . . . . . . . . . . . . . . . . 4-5
To Use Another Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
Compiler and Linker Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7
Process Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7
IB Hardware Contexts on the QDR IB Adapters. . . . . . . . . . . . . 4-8
Enabling and Disabling Software Context Sharing. . . . . . . . . . . 4-9
Restricting IB Hardware Contexts in a Batch Environment . . . . 4-10
Context Sharing Error Messages . . . . . . . . . . . . . . . . . . . . . . . . 4-11
Running in Shared Memory Mode . . . . . . . . . . . . . . . . . . . . . . . 4-11
mpihosts File Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12
Using Open MPI’s mpirun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
Console I/O in Open MPI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
Environment for Node Programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15
Remote Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15
Exported Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . 4-16
Setting MCA Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
Environment Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
Job Blocking in Case of Temporary IB Link Failures . . . . . . . . . . . . . . 4-20
Open MPI and Hybrid MPI/OpenMP Applications . . . . . . . . . . . . . . . . . . . . 4-21
OFED+ Host Software Release 1.5.4
User Guide
IB0054606-02 A v
OFED+ Host Software Release 1.5.4 User Guide
Debugging MPI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
MPI Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
Using Debuggers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
5 Using Other MPIs
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
Installed Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Open MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
MVAPICH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Compiling MVAPICH Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Running MVAPICH Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Further Information on MVAPICH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
MVAPICH2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Compiling MVAPICH2 Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Running MVAPICH2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Further Information on MVAPICH2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Managing MVAPICH, and MVAPICH2
with the mpi-selector Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Platform MPI 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
Compiling Platform MPI 8 Applications . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Running Platform MPI 8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
More Information on Platform MPI 8 . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Intel MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
Compiling Intel MPI Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10
Running Intel MPI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10
Further Information on Intel MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
Improving Performance of Other MPIs Over IB Verbs . . . . . . . . . . . . . . . . . 5-12
6 SHMEM Description and Configuration
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
SHMEM Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
Basic SHMEM Program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
Compiling SHMEM Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
vi IB0054606-02 A
OFED+ Host Software Release 1.5.4
User Guide
Running SHMEM Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Using shmemrun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Running programs without using shmemrun . . . . . . . . . . . . . . . 6-6
QLogic SHMEM Relationship with MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Slurm Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
Full Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
Two-step Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
No Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
Sizing Global Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
Progress Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11
Active Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Passive Progress. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Active versus Passive Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13
Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13
Implementation Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
Application Programming Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17
SHMEM Benchmark Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-27
7 Virtual Fabric support in PSM
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
Virtual Fabric Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
Using SL and PKeys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
Using Service ID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
SL2VL mapping from the Fabric Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
Verifying SL2VL tables on QLogic 7300 Series Adapters . . . . . . . . . . . . . . 7-4
8 Dispersive Routing
9gPXE
gPXE Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
Required Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
Preparing the DHCP Server in Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
Installing DHCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
Configuring DHCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4
Netbooting Over IB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
Boot Server Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
Steps on the gPXE Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14
HTTP Boot Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14
IB0054606-02 A vii
OFED+ Host Software Release 1.5.4 User Guide
A Benchmark Programs
Benchmark 1: Measuring MPI Latency Between Two Nodes . . . . . . . . . . . A-1
Benchmark 2: Measuring MPI Bandwidth Between Two Nodes . . . . . . . . . A-4
Benchmark 3: Messaging Rate Microbenchmarks. . . . . . . . . . . . . . . . . . . . A-6
OSU Multiple Bandwidth / Message Rate test (osu_mbw_mr)
A-6
An Enhanced Multiple Bandwidth / Message Rate test
(mpi_multibw) . . . . . . . . . . . . . . . . . A-7
B SRP Configuration
SRP Configuration Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
Important Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
QLogic SRP Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
Stopping, Starting and Restarting the SRP Driver . . . . . . . . . . . . . . . . B-3
Specifying a Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3
Determining the values to use for the configuration . . . . . . . . . . B-6
Specifying an SRP Initiator Port of a Session by Card and
Port Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-8
Specifying an SRP Initiator Port of Session by Port GUID . . . . . B-8
Specifying a SRP Target Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9
Specifying a SRP Target Port of a Session by IOCGUID . . . . . . B-10
Specifying a SRP Target Port of a Session by Profile String . . . B-10
Specifying an Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10
Restarting the SRP Module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11
Configuring an Adapter with Multiple Sessions . . . . . . . . . . . . . . . . . . B-11
Configuring Fibre Channel Failover. . . . . . . . . . . . . . . . . . . . . . . . . . . B-13
Failover Configuration File 1: Failing over from one
SRP Initiator port to another. . . . . . . . . . . . . . . . . . . . . . . . . . . B-14
Failover Configuration File 2: Failing over from a port on the
VIO hardware card to another port on the VIO hardware card. B-15
Failover Configuration File 3: Failing over from a port on a
VIO hardware card to a port on a different VIO hardware card
within the same Virtual I/O chassis . . . . . . . . . . . . . . . . . . . . . B-16
Failover Configuration File 4: Failing over from a port on a
VIO hardware card to a port on a different VIO hardware
card in a different Virtual I/O chassis . . . . . . . . . . . . . . . . . . . . B-17
Configuring Fibre Channel Load Balancing. . . . . . . . . . . . . . . . . . . . . B-18
1 Adapter Port and 2 Ports on a Single VIO. . . . . . . . . . . . . . . . B-18
2 Adapter Ports and 2 Ports on a Single VIO Module . . . . . . . . B-19
Using the roundrobinmode Parameter . . . . . . . . . . . . . . . . . . . . B-20
viii IB0054606-02 A
Configuring SRP for Native IB Storage . . . . . . . . . . . . . . . . . . . . . . . . B-21
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-23
Additional Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-24
Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-24
OFED SRP Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-24
C Integration with a Batch Queuing System
Clean Termination of MPI Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Clean-up PSM Shared Memory Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
D Troubleshooting
Using LEDs to Check the State of the Adapter . . . . . . . . . . . . . . . . . . . . . . D-1
BIOS Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-2
Kernel and Initialization Issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-2
Driver Load Fails Due to Unsupported Kernel. . . . . . . . . . . . . . . . . . . D-3
Rebuild or Reinstall Drivers if Different Kernel Installed . . . . . . . . . . . D-3
InfiniPath Interrupts Not Working. . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3
OpenFabrics Load Errors if ib_qib Driver Load Fails . . . . . . . . . . . . D-4
InfiniPath ib_qib Initialization Failure. . . . . . . . . . . . . . . . . . . . . . . . D-5
MPI Job Failures Due to Initialization Problems . . . . . . . . . . . . . . . . . D-6
OpenFabrics and InfiniPath Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-6
Stop Infinipath Services Before Stopping/Restarting InfiniPath. . . . . . D-6
Manual Shutdown or Restart May Hang if NFS in Use . . . . . . . . . . . . D-7
Load and Configure IPoIB Before Loading SDP . . . . . . . . . . . . . . . . . D-7
Set $IBPATH for OpenFabrics Scripts . . . . . . . . . . . . . . . . . . . . . . . . D-7
SDP Module Not Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-7
ibsrpdm Command Hangs when Two Host Channel
Adapters are Installed but Only Unit 1 is Connected
to the Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-8
Outdated ipath_ether Configuration Setup Generates Error . . . . . . . . D-8
System Administration Troubleshooting. . . . . . . . . . . . . . . . . . . . . . . . . . . . D-8
Broken Intermediate Link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-9
Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-9
Large Message Receive Side Bandwidth Varies with
Socket Affinity on Opteron Systems . . . . . . . . . . . . . . . . . . . . . . . . . D-9
Erratic Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-10
Method 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-10
Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-10
Immediately change the processor affinity of an IRQ. . . . . . . . . D-11
Performance Warning if ib_qib Shares Interrupts with eth0 . . . . . D-12
OFED+ Host Software Release 1.5.4
User Guide
IB0054606-02 A ix
OFED+ Host Software Release 1.5.4 User Guide
Open MPI Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-12
Invalid Configuration Warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-12
E ULP Troubleshooting
Troubleshooting VirtualNIC and VIO Hardware Issues . . . . . . . . . . . . . . . . E-1
Checking the logical connection between the
IB Host and the VIO hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-1
Verify that the proper VirtualNIC driver is running . . . . . . . . . . . E-2
Verifying that the qlgc_vnic.cfg file contains the correct
information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-2
Verifying that the host can communicate with the I/O
Controllers (IOCs) of the VIO hardware . . . . . . . . . . . . . . . . . . E-3
Checking the interface definitions on the host. . . . . . . . . . . . . . . . . . . E-6
Interface does not show up in output of 'ifconfig' . . . . . . . . . . . . E-6
Verify the physical connection between the VIO hardware and
the Ethernet network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-7
Troubleshooting SRP Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-9
ib_qlgc_srp_stats showing session in disconnected state . . . . . E-9
Session in 'Connection Rejected' state . . . . . . . . . . . . . . . . . . . . . . . . E-11
Attempts to read or write to disk are unsuccessful . . . . . . . . . . . . . . . E-14
Four sessions in a round-robin configuration are active . . . . . . . . . . . E-15
Which port does a port GUID refer to? . . . . . . . . . . . . . . . . . . . . . . . . E-16
How does the user find a HCA port GUID?. . . . . . . . . . . . . . . . . . . . . E-17
Need to determine the SRP driver version.. . . . . . . . . . . . . . . . . . . . . E-19
F Write Combining
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-1
PAT and Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-1
MTRR Mapping and Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-2
Edit BIOS Settings to Fix MTRR Issues . . . . . . . . . . . . . . . . . . . . . . . F-2
Use the ipath_mtrr Script to Fix MTRR Issues. . . . . . . . . . . . . . . . F-2
Verify Write Combining is Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-3
G Commands and Files
Check Cluster Homogeneity with ipath_checkout . . . . . . . G-1
Restarting InfiniPath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-2
Summary and Descriptions of Commands. . . . . . . . . . . . . . . . . . . . . . . . . . G-2
dmesg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-4
iba_opp_query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-4
iba_hca_rev. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-9
iba_manage_switch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-19
x IB0054606-02 A
OFED+ Host Software Release 1.5.4
User Guide
iba_packet_capture. . . . . . . . . . . . . . . . G-21
ibhosts . . . . . . . . . . . . . . . . . . . . . G-22
ibstatus. . . . . . . . . . . . . . . . . . . . . G-22
ibtracert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-23
ibv_devinfo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-24
ident . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-24
ipath_checkout. . . . . . . . . . . . . . . . . . G-25
ipath_control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-27
ipath_mtrr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-28
ipath_pkt_test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-29
ipathstats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-30
lsmod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-30
modprobe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-30
mpirun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-31
mpi_stress. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-31
rpm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-32
strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-32
Common Tasks and Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-32
Summary and Descriptions of Useful Files . . . . . . . . . . . . . . . . . . . . . . . . . G-34
boardversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-34
status_str. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-35
version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-36
Summary of Configuration Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-36
H Recommended Reading
References for MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-1
Books for Learning MPI Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-1
Reference and Source for SLURM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-1
InfiniBand
OpenFabrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-2
Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-2
Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-2
Rocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-2
Other Software Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-2
IB0054606-02 A xi
® . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
H-1
OFED+ Host Software Release 1.5.4 User Guide

List of Figures

3-1 QLogic OFED+ Software Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3-2 Distributed SA Default Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
3-3 Distributed SA Multiple Virtual Fabrics Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
3-4 Distributed SA Multiple Virtual Fabrics Configured Example . . . . . . . . . . . . . . . . . . 3-15
3-5 Virtual Fabrics with Overlapping Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
3-6 Virtual Fabrics with PSM_MPI Virtual Fabric Enabled . . . . . . . . . . . . . . . . . . . . . . . 3-16
3-7 Virtual Fabrics with all SIDs assigned to PSM_MPI Virtual Fabric. . . . . . . . . . . . . . 3-16
3-8 Virtual Fabrics with Unique Numeric Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
xii IB0054606-02 A
OFED+ Host Software Release 1.5.4
User Guide

List of Tables

3-1 ibmtu Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
3-2 krcvqs Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-27
3-3 Checks Preformed by ipath_perf_tuning Tool . . . . . . . . . . . . . . . . . . . . . . . . . 3-34
3-4 ipath_perf_tuning Tool Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35
3-5 Test Execution Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36
4-1 Open MPI Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
4-2 Command Line Options for Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
4-3 Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
4-4 Portland Group (PGI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
4-5 Available Hardware and Software Contexts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4-6 Environment Variables Relevant for any PSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
4-7 Environment Variables Relevant for Open MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20
5-1 Other Supported MPI Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
5-2 MVAPICH Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
5-3 MVAPICH Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
5-4 Platform MPI 8 Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
5-5 Intel MPI Wrapper Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10
6-1 SHMEM Run Time Library Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . 6-13
6-2 shmemrun Environment Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
6-3 SHMEM Application Programming Interface Calls. . . . . . . . . . . . . . . . . . . . . . . . . . 6-18
6-4 QLogic SHMEM micro-benchmarks options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-27
6-5 QLogic SHMEM random access benchmark options. . . . . . . . . . . . . . . . . . . . . . . . 6-28
6-6 QLogic SHMEM all-to-all benchmark options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-29
6-7 QLogic SHMEM barrier benchmark options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-30
6-8 QLogic SHMEM reduce benchmark options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-31
D-1 LED Link and Data Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1
G-1 Useful Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-2
G-2 ipath_checkout Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-26
G-3 Common Tasks and Commands Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-33
G-4 Useful Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-34
G-5 status_str File Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-35
G-6 Status—Other Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-36
G-7 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-37
IB0054606-02 A xiii
OFED+ Host Software Release 1.5.4 User Guide
xiv IB0054606-02 A

Preface

The QLogic OFED+ Host Software User Guide shows end users how to use the installed software to setup the fabric. End users include both the cluster administrator and the Message-Passing Interface (MPI) application programmers, who have different but overlapping interests in the details of the technology.
For specific instructions about installing the QLogic QLE7340, QLE7342, QMH7342, and QME7342 PCI Express
InfiniBand Fabric Software, see the QLogic InfiniBand
®
Adapter Hardware Installation Guide, and the initial installation of the
®

Intended Audience

This guide is intended for end users responsible for administration of a cluster network as well as for end users who want to use that cluster.
This guide assumes that all users are familiar with cluster computing, that the cluster administrator is familiar with Linux programmer is familiar with MPI, vFabrics, SRP, and Distributed SA.

Related Materials

QLogic InfiniBand® Adapter Hardware Installation Guide
QLogic InfiniBand
Release Notes
®
Fabric Software Installation Guide

Documentation Conventions

(PCIe®) adapters see the QLogic
®
Fabric Software Installation Guide.
®
administration, and that the application
This guide uses the following documentation conventions:
NOTE: provides additional information.
CAUTION!
causing damage to data or equipment.
WARNING!!
causing personal injury.
IB0054606-02 A xv
indicates the presence of a hazard that has the potential of
indicates the presence of a hazard that has the potential of
Preface License Agreements
Tex t i n blue font indicates a hyperlink (jump) to a figure, table, or section in
this guide, and links to Web sites are shown in underlined blue example:
Table 9-2 lists problems related to the user interface and remote agent.
See “Installation Checklist” on page 3-6.
. For
For more information, visit www.qlogic.com
Tex t i n bold font indicates user interface elements such as a menu items,
buttons, check boxes, or column headings. For example:
Click the Start button, point to Programs, point to Accessories, and
then click Command Prompt.
Under Notification Options, select the Warning Alarms check box.
Tex t i n Courier font indicates a file name, directory path, or command line
text. For example:
To return to the root directory from anywhere in the file structure:
Type
cd /root and press ENTER.
Enter the following command: sh ./install.bin
Key names and key strokes are indicated with UPPERCASE:
Press CTRL+P.
Press the UP ARROW key.
Tex t i n italics indicates terms, emphasis, variables, or document titles. For
example:
For a complete listing of license agreements, refer to the QLogic
Software End User License Agreement.
.
What are shortcut keys?
To enter the date type mm/dd/yyyy (where mm is the month, dd is the
day, and yyyy is the year).
Topic titles between quotation marks identify related topics either within this
manual or in the online help, which is also referred to as the help system throughout this document.

License Agreements

Refer to the QLogic Software End User License Agreement for a complete listing of all license agreements affecting this product.
xvi IB0054606-02 A

Technical Support

Customers should contact their authorized maintenance provider for technical support of their QLogic products. QLogic-direct customers may contact QLogic Technical Support; others will be redirected to their authorized maintenance provider. Visit the QLogic support Web site listed in Contact Information for the latest firmware and software updates.
For details about available service plans, or for information about renewing and extending your service, visit the Service Program web page at
http://www.qlogic.com/services

Training

QLogic offers training for technical professionals for all iSCSI, InfiniBand® (IB), and Fibre Channel products. From the main QLogic web page at www.qlogic.com click the Support tab at the top, and then click Training and Certification on the left. The QLogic Global Training portal offers online courses, certification exams, and scheduling of in-person training.
Preface
Technical Support
.
,
Technical Certification courses include installation, maintenance and troubleshooting QLogic products. Upon demonstrating knowledge using live equipment, QLogic awards a certificate identifying the student as a certified professional. You can reach the training professionals at QLogic by e-mail at
training@qlogic.com

Contact Information

QLogic Technical Support for products under warranty is available during local standard working hours excluding QLogic Observed Holidays. For customers with extended service, consult your plan for available hours.For Support phone numbers, see the Contact Support link at support@qlogic.com
Support Headquarters
QLogic Web Site
Technical Support Web Site
Technical Support E-mail
Technical Training E-mail
.
.
QLogic Corporation 4601 Dean Lakes Blvd. Shakopee, MN 55379 USA
www.qlogic.com
http://support.qlogic.com
support@qlogic.com
training@qlogic.com
IB0054606-02 A xvii
Preface Technical Support

Knowledge Database

The QLogic knowledge database is an extensive collection of QLogic product information that you can search for specific solutions. We are constantly adding to the collection of information in our database to provide answers to your most urgent questions. Access the database from the QLogic Support Center:
http://support.qlogic.com.
xviii IB0054606-02 A

1 Introduction

How this Guide is Organized

The QLogic OFED+ Host Software User Guide is organized into these sections:
Section 1, provides an overview and describes interoperability.
Section 2, describes how to setup your cluster to run high-performance MPI
jobs.
Section 3, describes the lower levels of the supplied QLogic OFED+ Host
software. This section is of interest to a InfiniBand
Section 4, helps the
best use of the Open MPI implementation. Examples are provided for compiling and running MPI programs.
Section 5, gives examples for compiling and running MPI programs with
other MPI implementations.
Section 7, describes QLogic Performance Scaled Messaging (PSM) that
provides support for full Virtual Fabric (vFabric) integration, allowing users to specify InfiniBand provide a configured Service ID (SID) to target a vFabric.
Section 8, describes dispersive routing in the InfiniBand
congestion hotspots by “sraying” messages across the multiple potential paths.
Section 9, describes open-source Preboot Execution Environment
boot including installation and setup.
Appendix A, describes how to run QLogic’s performance measurement
programs.
Message Passing Interface (MPI) programmer make the
®
Service Level (SL) and Partition Key (PKey), or to
®
cluster administrator.
®
fabric to avoid
(gPXE)
Appendix B, describes SCSI RDMA Protocol (SRP) configuration that allows
the SCSI protocol to run over InfiniBand usage.
IB0054606-02 A 1-1
®
for Storage Area Network (SAN)
1–Introduction
NOTE

Overview

Appendix C, describes two methods the administrator can use to allow users
to submit MPI jobs through batch queuing systems.
Appendix D, provides information for troubleshooting installation, cluster
administration, and MPI.
Appendix E, provides information for troubleshooting the upper layer
protocol utilities in the fabric.
Appendix F, provides instructions for checking write combining and for using
the Page Attribute Table (PAT) and Memory Type Range Registers (MTRR).
Appendix G, contains useful programs and files for debugging, as well as
commands for common tasks.
Appendix H, contains a list of useful web sites and documents for a further
In addition, the QLogic InfiniBand information on QLogic hardware installation and the QLogic InfiniBand Software Installation Guide contains information on QLogic software installation.
Overview
The material in this documentation pertains to a QLogic OFED+ cluster. A cluster is defined as a collection of nodes, each attached to an InfiniBand through the QLogic interconnect.
The QLogic IB Host Channel Adapters (HCA) are InfiniBand quad data rate (QDR) adapters (QLE7340, QLE7342, QMH7342, and QME7342) have a raw data rate of 40Gbps (data rate of 32Gbps). The QLE7340, QLE7342, QMH7342, and QME7342 adapters can also run in DDR or SDR mode.
The QLogic IB HCA utilize standard, off-the-shelf InfiniBand cabling. The QLogic interconnect is designed to work with all InfiniBand
understanding of the InfiniBand
®
Adapter Hardware Installation Guide contains
®
-compliant switches.
®
fabric, and related information.
®
4X adapters. The
®
4X switches and
®
Fabric
®
-based fabric
If you are using the QLE7300 series adapters in QDR mode, a QDR switch must be used.
QLogic OFED+ software is interoperable with other vendors’ IBTA compliant InfiniBand
®
adapters running compatible OFED releases. There are several
options for subnet management in your cluster:
1-2 IB0054606-02 A
1–Introduction
NOTE

Interoperability

An embedded subnet manager can be used in one or more managed
switches. QLogic offers the QLogic Embedded Fabric Manager (FM) for both DDR and QDR switch product lines supplied by your IB switch vendor.
A host-based subnet manager can be used. QLogic provides the QLogic
Fabric Manager (FM), as a part of the QLogic InfiniBand
Interoperability
QLogic OFED+ participates in the standard IB subnet management protocols for configuration and monitoring. Note that:
QLogic OFED+ (including Internet Protocol over InfiniBand
interoperable with other vendors’ InfiniBand OFED releases.
In addition to supporting running MPI over verbs, QLogic provides a
high-performance InfiniBand PSM. MPIs run over PSM will not interoperate with other adapters.
See the OpenFabrics web site at www.openfabrics.org for more information on the OpenFabrics Alliance.
®
Fabric Suite (IFS).
®
®
adapters running compatible
®
-Compliant vendor-specific protocol, known as
(IPoIB)) is
IB0054606-02 A 1-3
1–Introduction Interoperability
1-4 IB0054606-02 A
2 Step-by-Step Cluster Setup
and MPI Usage Checklists
This section describes how to set up your cluster to run high-performance
Message Passing Interface (MPI) jobs.

Cluster Setup

Perform the following tasks when setting up the cluster. These include BIOS, adapter, and system settings.
1. Make sure that hardware installation has been completed according to the instructions in the QLogic InfiniBand and software installation and driver configuration has been completed according to the instructions in the QLogic InfiniBand Installation Guide. To minimize management problems, the compute nodes of the cluster must have very similar hardware configurations and identical software installations. See “Homogeneous Nodes” on page 3-37 for more information.
2. Check that the BIOS is set properly according to the instructions in the
QLogic InfiniBand
3. Set up the Distributed your virtual fabrics. See “QLogic Distributed Subnet Administration” on
page 3-12
4. Adjust settings, including setting the appropriate MTU size. See “Adapter
and Other Settings” on page 3-38.
5. Remove unneeded services. See “Remove Unneeded Services” on
page 3-39.
6. Disable powersaving features. See “Host Environment Setup for MPI” on
page 3-40.
®
Adapter Hardware Installation Guide.
Subnet Administration (SA) to correctly synchronize
®
Adapter Hardware Installation Guide
®
Fabric Software
7. Check other performance tuning settings. See “Performance Settings and
Management Tips” on page 3-24.
IB0054606-02 A 2-1
2–Step-by-Step Cluster Setup and MPI Usage Checklists Using MPI
8. Set up the host environment to use ssh. Two methods are discussed in
“Host Environment Setup for MPI” on page 3-40.
9. Verify the cluster setup. See “Checking Cluster and Software Status” on
page 3-44.

Using MPI

1. Verify that the QLogic hardware and software has been installed on all the nodes you will be using, and that ssh is set up on your cluster (see all the steps in the Cluster Setup checklist).
2. Setup Open MPI. See “Setup” on page 4-2.
3. Compile Open MPI applications. See “Compiling Open MPI Applications” on
page 4-2
4. Create an mpihosts file that lists the nodes where your programs will run. See “Create the mpihosts File” on page 4-3.
5. Run Open MPI applications. See “Running Open MPI Applications” on
page 4-3.
6. Configure MPI programs for Open MPI. See “Configuring MPI Programs for
Open MPI” on page 4-5
7. To test using other MPIs that run over PSM, such as MVAPICH, MVAPICH2, Platform MPI, and Intel MPI, see Section 5 Using Other MPIs.
8. To switch between multiple versions of MVAPICH, use the mpi-selector. See “Managing MVAPICH, and MVAPICH2 with the mpi-selector Utility” on
page 5-5.
9. Refer to “Performance Tuning” on page 3-25 to read more about runtime performance tuning.
10. Refer to Section 5 Using Other MPIs to learn about using other MPI implementations.
2-2 IB0054606-02 A
3 InfiniBand
InfiniBand®/OpenFabrics
User Verbs
MPI Applications
QLogic OFED+ Driver ib_qib
Kernel Space
uMAD API
User Space
QLogic OFED+
Communication
Library (PSM)
QLogic OFED+
Hardware
TCP/IP
IPoIB
QLogic IB adapter
Platform MPI
MVAPICH
Open MPI
MVAPICH
Open MPI
Intel MPI
QLogic FM
MVAPICH2
Common
Intel MPI
uDAPL
Platform MPI
SRP
MVAPICH2
and Administration
This section describes what the cluster administrator needs to know about the QLogic OFED+ software and system administration.

Introduction

The IB driver ib_qib, QLogic Performance Scaled Messaging (PSM), accelerated Message-Passing Interface (MPI) stack, the protocol and MPI support libraries, and other modules are components of the QLogic OFED+ software. This software provides the foundation that supports the MPI implementation.
Figure 3-1 illustrates these relationships. Note that HP-MPI, Platform MPI, Intel
MPI, MVAPICH, MVAPICH2, and Open MPI can run either over PSM or OpenFabrics
®
User Verbs.
®
Cluster Setup
Figure 3-1. QLogic OFED+ Software Structure
IB0054606-02 A 3-1
3–InfiniBand® Cluster Setup and Administration Installed Layout

Installed Layout

This section describes the default installed layout for the QLogic OFED+ software and QLogic-supplied MPIs.
QLogic-supplied Open MPI, MVAPICH, and MVAPICH2 RPMs with PSM support and compiled with GCC, PGI, and the Intel compilers are installed in directories using the following format:
/usr/mpi/<compiler>/<mpi>-<mpi_version>-qlc
For example:
/usr/mpi/gcc/openmpi-1.4-qlc
QLogic OFED+ utility programs, are installed in:
/usr/bin
/sbin
/opt/iba/*
Documentation is found in:
/usr/share/man
/usr/share/doc/infinipath
License information is found only in usr/share/doc/infinipath. QLogic OFED+ Host Software user documentation can be found on the QLogic web site on the software download page for your distribution.
Configuration files are found in:
/etc/sysconfig
Init scripts are found in:
/etc/init.d
The IB driver modules in this release are installed in:
/lib/modules/$(uname -r)/ updates/kernel/drivers/infiniband/hw/qib
Most of the other OFED modules are installed under the infiniband subdirectory. Other modules are installed under:
/lib/modules/$(uname -r)/updates/kernel/drivers/net
The RDS modules are installed under:
/lib/modules/$(uname -r)/updates/kernel/net/rds
3-2 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration

IB and OpenFabrics Driver Overview

IB and OpenFabrics Driver Overview
The ib_qib module provides low-level QLogic hardware support, and is the base driver for both MPI/PSM programs and general OpenFabrics protocols such as IPoIB and sockets direct protocol (SDP). The driver also supplies the Subnet Management Agent (SMA) component.
The following is a list of the optional configurable OpenFabrics components and their default settings:
IPoIB network interface. This component is required for TCP/IP networking
for running IP traffic over the IB link. It is not running until it is configured.
OpenSM. This component is disabled at startup. QLogic recommends using
the QLogic Fabric Manager (FM), which is included with the IFS or optionally available within the QLogic switches. QLogic FM or OpenSM can be installed on one or more nodes with only one node being the master SM.
SRP (OFED and QLogic modules). SRP is not running until the module is
loaded and the SRP devices on the fabric have been discovered.
MPI over uDAPL (can be used by Intel MPI). IPoIB must be configured
before MPI over uDAPL can be set up.
Other optional drivers can now be configured and enabled, as described in “IPoIB
Network Interface Configuration” on page 3-3.
Complete information about starting, stopping, and restarting the QLogic OFED+ services are in “Managing the ib_qib Driver” on page 3-21.

IPoIB Network Interface Configuration

The following instructions show you how to manually configure your OpenFabrics IPoIB network interface. QLogic recommends using the QLogic OFED+ Host Software Installation package or the iba_config tool. For larger clusters, FastFabric can be used to automate installation and configuration of many nodes. These tools automate the configuration of the IPoIB network interface. This example assumes that you are using QLogic OFED+ and OpenFabric’s RPMs are installed, and your startup scripts have been run (either manually or at system boot).
sh or bash as your shell, all required
For this example, the IPoIB network is 10.1.17.0 (one of the networks reserved for private use, and thus not routable on the Internet), with a /8 host portion. In this case, the netmask must be specified.
IB0054606-02 A 3-3
3–InfiniBand® Cluster Setup and Administration
NOTE
IPoIB Network Interface Configuration
This example assumes that no hosts files exist, the host being configured has the IP address 10.1.17.3, and DHCP is not used.
Instructions are only for this static IP address case. Configuration methods for using DHCP will be supplied in a later release.
1. Type the following command (as a root user):
ifconfig ib0 10.1.17.3 netmask 0xffffff00
2. To verify the configuration, type:
ifconfig ib0
ifconfig ib1
The output from this command will be similar to:
ib0 Link encap:InfiniBand HWaddr
00:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00: 00
inet addr:10.1.17.3 Bcast:10.1.17.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:4096 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
3. Type:
ping -c 2 -b 10.1.17.255
The output of the ping command will be similar to the following, with a line for each host already configured and connected:
WARNING: pinging broadcast address
PING 10.1.17.255 (10.1.17.255) 517(84) bytes of data.
174 bytes from 10.1.17.3: icmp_seq=0 ttl=174 time=0.022 ms
64 bytes from 10.1.17.1: icmp_seq=0 ttl=64 time=0.070 ms (DUP!)
64 bytes from 10.1.17.7: icmp_seq=0 ttl=64 time=0.073 ms (DUP!)
The IPoIB network interface is now configured.
4. Restart (as a root user) by typing:
/etc/init.d/openibd restart
3-4 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
NOTE
The configuration must be repeated each time the system is rebooted.IPoIB-CM (Connected Mode) is enabled by default. The setting in
/etc/infiniband/openib.conf is SET_IPOIB_CM=yes. To use
datagram mode, change the setting to also be changed when asked during initial installation (./INSTALL).

IPoIB Administration

Administering IPoIB

Stopping, Starting and Restarting the IPoIB Driver
IPoIB Administration
SET_IPOIB_CM=no. Setting can
QLogic recommends using the QLogic IFS Installer TUI or iba_config command to enable autostart for the IPoIB driver. Refer to the QLogic InfiniBand Software Installation Guide for more information. For using the command line to stop, start, and restart the IPoIB driver use the following commands.
To stop the IPoIB driver, use the following command:
/etc/init.d/openibd stop
To start the IPoIB driver, use the following command:
/etc/init.d/openibd start
To restart the IPoIB driver, use the following command:
/etc/init.d/openibd restart

Configuring IPoIB

QLogic recommends using the QLogic IFS Installer TUI, FastFabric, or iba_config command to configure the boot time and autostart of the IPoIB driver. Refer to the QLogic InfiniBand information on using the QLogic IFS Installer TUI. Refer to the QLogic FastFabric User Guide for more information on using FastFabric. For using the command line to configure the IPoIB driver use the following commands.
®
Fabric
®
Fabric Software Installation Guide for more
Editing the IPoIB Configuration File
1. For each IP Link Layer interface, create an interface configuration file,
/etc/sysconfig/network/ifcfg-NAME, where NAME is the value of the
IB0054606-02 A 3-5
3–InfiniBand® Cluster Setup and Administration
NOTE
NOTE

IB Bonding

NAME field specified in the CREATE block. The following is an example of the
ifcfg-NAME file:
DEVICE=ib1
BOOTPROTO=static
BROADCAST=192.168.18.255
IPADDR=192.168.18.120
NETMASK=255.255.255.0
ONBOOT=yes
NM_CONTROLLED=no
For IPoIB, the INSTALL script for the adapter now helps the user create the
2. After modifying the /etc/sysconfig/ipoib.cfg file, restart the IPoIB driver with the following:
/etc/init.d/openibd restart
ifcfg files.
IB Bonding
IB bonding is a high availability solution for IPoIB interfaces. It is based on the Linux Ethernet Bonding Driver and was adopted to work with IPoIB. The support for IPoIB interfaces is only for the active-backup mode, other modes should not be used. QLogic supports bonding across HCA ports and bonding port 1 and port 2 on the same HCA.

Interface Configuration Scripts

Create interface configuration scripts for the ibX and bondX interfaces. Once the configurations are in place, perform a server reboot, or a service network restart. For SLES operating systems (OS), a server reboot is required. Refer to the following standard syntax for bonding configuration by the OS.
For all of the following OS configuration script examples that set MTU, MTU=65520 is valid only if all IPoIB slaves operate in connected mode and are configured with the same value. For IPoIB slaves that work in datagram mode, use MTU=2044. If the MTU is not set correctly or the MTU is not set at all (set to the default value), performance of the interface may be lower.
3-6 IB0054606-02 A
Red Hat EL5 and EL6
The following is an example for bond0 (master). The file is named /etc/sysconfig/network-scripts/ifcfg-bond0:
DEVICE=bond0
IPADDR=192.168.1.1
NETMASK=255.255.255.0
NETWORK=192.168.1.0
BROADCAST=192.168.1.255
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
MTU=65520
BONDING_OPTS="primary=ib0 updelay=0 downdelay=0"
The following is an example for ib0 (slave). The file is named /etc/sysconfig/network-scripts/ifcfg-ib0:
DEVICE=ib0
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
TYPE=InfiniBand
PRIMARY=yes
3–InfiniBand® Cluster Setup and Administration
IB Bonding
The following is an example for ib1 (slave 2). The file is named
/etc/sysconfig/network-scripts/ifcfg-ib1:
DEVICE=ib1
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
TYPE=InfiniBand
Add the following lines to the RHEL 5.x file /etc/modprobe.conf, or the RHEL 6.x file /etc/modprobe.d/ib_qib.conf:
alias bond0 bonding
options bond0 miimon=100 mode=1 max_bonds=1
IB0054606-02 A 3-7
3–InfiniBand® Cluster Setup and Administration IB Bonding
SuSE Linux Enterprise Server (SLES) 10 and 11
The following is an example for bond0 (master). The file is named /etc/sysconfig/network-scripts/ifcfg-bond0:
DEVICE="bond0"
TYPE="Bonding"
IPADDR="192.168.1.1"
NETMASK="255.255.255.0"
NETWORK="192.168.1.0"
BROADCAST="192.168.1.255"
BOOTPROTO="static"
USERCTL="no"
STARTMODE="onboot"
BONDING_MASTER="yes"
BONDING_MODULE_OPTS="mode=active-backup miimon=100 primary=ib0 updelay=0 downdelay=0"
BONDING_SLAVE0=ib0
BONDING_SLAVE1=ib1
MTU=65520
The following is an example for ib0 (slave). The file is named /etc/sysconfig/network-scripts/ifcfg-ib0:
DEVICE='ib0'
BOOTPROTO='none'
STARTMODE='off'
WIRELESS='no'
ETHTOOL_OPTIONS=''
NAME=''
USERCONTROL='no'
IPOIB_MODE='connected'
The following is an example for ib1 (slave 2). The file is named /etc/sysconfig/network-scripts/ifcfg-ib1:
DEVICE='ib1'
BOOTPROTO='none'
STARTMODE='off'
WIRELESS='no'
ETHTOOL_OPTIONS=''
NAME=''
USERCONTROL='no'
IPOIB_MODE='connected'
3-8 IB0054606-02 A
Verify the following line is set to the value of yes in /etc/sysconfig/boot:
RUN_PARALLEL="yes"

Verify IB Bonding is Configured

After the configuration scripts are updated, and the service network is restarted or a server reboot is accomplished, use the following CLI commands to verify that IB bonding is configured.
cat /proc/net/bonding/bond0
# ifconfig
Example of cat /proc/net/bonding/bond0 output:
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.2.3 (December 6, 2007)
Bonding Mode: fault-tolerance (active-backup) (fail_over_mac)
Primary Slave: ib0
Currently Active Slave: ib0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
3–InfiniBand® Cluster Setup and Administration
IB Bonding
Slave Interface: ib0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 80:00:04:04:fe:80
Slave Interface: ib1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 80:00:04:05:fe:80
IB0054606-02 A 3-9
3–InfiniBand® Cluster Setup and Administration Subnet Manager Configuration
Example of ifconfig output:
st2169:/etc/sysconfig # ifconfig
bond0 Link encap:InfiniBand HWaddr 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::211:7500:ff:909b/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:65520 Metric:1
RX packets:120619276 errors:0 dropped:0 overruns:0 frame:0
TX packets:120619277 errors:0 dropped:137 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:10132014352 (9662.6 Mb) TX bytes:10614493096 (10122.7 Mb)
ib0 Link encap:InfiniBand HWaddr 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
UP BROADCAST RUNNING SLAVE MULTICAST MTU:65520 Metric:1
RX packets:118938033 errors:0 dropped:0 overruns:0 frame:0
TX packets:118938027 errors:0 dropped:41 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:9990790704 (9527.9 Mb) TX bytes:10466543096 (9981.6 Mb)
ib1 Link encap:InfiniBand HWaddr 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
UP BROADCAST RUNNING SLAVE MULTICAST MTU:65520 Metric:1
RX packets:1681243 errors:0 dropped:0 overruns:0 frame:0
TX packets:1681250 errors:0 dropped:96 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:141223648 (134.6 Mb) TX bytes:147950000 (141.0 Mb)

Subnet Manager Configuration

QLogic recommends using the QLogic Fabric Manager to manage your fabric. Refer to the QLogic Fabric Manager User Guide for information on configuring the QLogic Fabric Manager.
3-10 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
!
WARNING
Subnet Manager Configuration
OpenSM is a component of the OpenFabrics project that provides a Subnet Manager (SM) for IB networks. This package can optionally be installed on any machine, but only needs to be enabled on the machine in the cluster that will act as a subnet manager. You cannot use OpenSM if any of your IB switches provide a subnet manager, or if you are running a host-based SM, for example the QLogic Fabric Manager.
Don’t run OpenSM with QLogic FM in the same fabric.
If you are using the Installer tool, you can set the OpenSM default behavior at the time of installation.
OpenSM only needs to be enabled on the node that acts as the subnet manager.Toenable OpenSM the iba_config command can be used or the
chkconfig command (as a root user) can be used on the node where it will be
run. The chkconfig command to enable the OpenSM is:
chkconfig opensmd on
The chkconfig command to disable it on reboot is:
chkconfig opensmd off
You can start opensmd without rebooting your machine by typing:
/etc/init.d/opensmd start
You can stop opensmd by typing:
/etc/init.d/opensmd stop
If you want to pass any arguments to the OpenSM program, modify the following file, and add the arguments to the
/etc/init.d/opensmd
OPTIONS variable:
For example:
Use the UPDN algorithm instead of the Min Hop algorithm. OPTIONS="-R updn"
For more information on OpenSM, see the OpenSM man pages, or look on the OpenFabrics web site.
IB0054606-02 A 3-11
3–InfiniBand® Cluster Setup and Administration QLogic Distributed Subnet Administration

QLogic Distributed Subnet Administration

As InfiniBand® clusters are scaled into the Petaflop range and beyond, a more efficient method for handling queries to the Fabric Manager is required. One of the issues is that while the Fabric Manager can configure and operate that many nodes, under certain conditions it can become overloaded with queries from those same nodes.
For example, consider an IB fabric consisting of 1,000 nodes, each with 4 processors. When a large MPI job is started across the entire fabric, each process needs to collect IB path records for every other node in the fabric - and every single process is going to be querying the subnet manager for these path records at roughly the same time. This amounts to a total of 3.9 million path queries just to start the job.
In the past, MPI implementations have side-stepped this problem by hand crafting path records themselves, but this solution cannot be used if advanced fabric management techniques such as virtual fabrics and mesh/torus configurations are being used. In such cases, only the subnet manager itself has enough information to correctly build a path record between two nodes.
The Distributed Subnet Administration (SA) solves this problem by allowing each node to locally replicate the path records needed to reach the other nodes on the fabric. At boot time, each Distributed SA queries the subnet manager for information about the relevant parts of the fabric, backing off whenever the subnet manager indicates that it is busy. Once this information is in the Distributed SA's database, it is ready to answer local path queries from MPI or other IB applications. If the fabric changes (due to a switch failure or a node being added or removed from the fabric) the Distributed SA updates the affected portions of the database. The Distributed SA can be installed and run on any node in the fabric. It is only needed on nodes running MPI applications.

Applications that use Distributed SA

The QLogic PSM Library has been extended to take advantage of the Distributed SA. Therefore, all MPIs that use the QLogic PSM library can take advantage of the Distributed SA. Other applications must be modified specifically to take advantage of it. For developers writing applications that use the Distributed SA, refer to the header file /usr/include/Infiniband/ofedplus_path.h for information on using Distributed SA APIs. This file can be found on any node where the Distributed SA is installed. For further assistance please contact QLogic Support.
3-12 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
9LUWXDO)DEULF³'HIDXOW´
3NH\[IIII
6,'5DQJH[[IIIIIIIIIIIIIIII
,QILQLEDQG)DEULF
9LUWXDO)DEULF³'HIDXOW´
3NH\[IIII
6,'5DQJH[[I
6,'5DQJH[[I
'LVWULEXWHG6$

Virtual Fabrics and the Distributed SA

The IBTA standard states that applications can be identified by a Service ID (SID). The QLogic Fabric Manager uses SIDs to identify applications. One or more applications can be associated with a Virtual Fabric using the SID. The Distributed SA is designed to be aware of Virtual Fabrics, but to only store records for those Virtual Fabrics that match the SIDs in the Distributed SA's configuration file. The Distributed SA recognizes when multiple SIDs match the same Virtual Fabric and will only store one copy of each path record within a Virtual Fabric. SIDs that match more than one Virtual Fabric will be associated with a single Virtual Fabric. The Virtual Fabrics that do not match SIDs in the Distributed SA's database will be ignored.

Configuring the Distributed SA

In order to absolutely minimize the number of queries made by the Distributed SA, it is important to configure it correctly, both to match the configuration of the Fabric Manager and to exclude those portions of the fabric that will not be used by applications using the Distributed SA. The configuration file for the Distributed SA is named /etc/sysconfig/iba/qlogic_sa.conf.
QLogic Distributed Subnet Administration

Default Configuration

As shipped, the QLogic Fabric Manager creates a single virtual fabric, called “Default” and maps all nodes and Service IDs to it, and the Distributed SA ships with a configuration that lists a set of thirty-one SIDs, 0x1000117500000000 through 0x100011750000000f and 0x1 through 0xf. This results in an arrangement like the one shown in Figure 3-2
Figure 3-2. Distributed SA Default Configuration
IB0054606-02 A 3-13
3–InfiniBand® Cluster Setup and Administration
9LUWXDO)DEULF³$GPLQ´
3NH\[III
,QILQLEDQG)DEULF
9LUWXDO)DEULF
³360B03,´
3NH\[
6,'5DQJH[[I
6,'5DQJH
[
[I
9LUWXDO)DEULF³6WRUDJH´
3NH\[
6,'
[
9LUWXDO)DEULF
³5HVHUYHG´
3NH\[
6,'5DQJH[[I
'LVWULEXWHG6$
9LUWXDO)DEULF³360B03,´
3NH\[
6,'5DQJH[[I
6,'5DQJH[
[I
QLogic Distributed Subnet Administration
If you are using the QLogic Fabric Manager in its default configuration, and you are using the standard QLogic PSM SIDs, this arrangement will work fine and you will not need to modify the Distributed SA's configuration file - but notice that the Distributed SA has restricted the range of SIDs it cares about to those that were defined in its configuration file. Attempts to get path records using other SIDs will not work, even if those other SIDs are valid for the fabric. When using this default configuration it is necessary that MPI applications only be run using one of these 32 SIDs.

Multiple Virtual Fabrics Example

A person configuring the physical IB fabric may want to limit how much IB bandwidth MPI applications are permitted to consume. In that case, they may re-configure the QLogic Fabric Manager, turning off the “Default” Virtual Fabric and replacing it with several other Virtual Fabrics.
In Figure 3-3, the administrator has divided the physical fabric into four virtual fabrics: “Admin” (used to communicate with the Fabric Manager), “Storage” (used by SRP), “PSM_MPI” (used by regular MPI jobs) and a special “Reserved” fabric for special high-priority jobs.
3-14 IB0054606-02 A
Figure 3-3. Distributed SA Multiple Virtual Fabrics Example
Due to the fact that the Distributed SA was not configured to include the SID Range 0x10 through 0x1f, it has simply ignored the “Reserved” VF. Adding those SIDs to the qlogic_sa.conf file solves the problem as shown in Figure 3-4.
3–InfiniBand® Cluster Setup and Administration
9LUWXDO)DEULF³$GPLQ´
3NH\[III
,QILQLEDQG)DEULF
9LUWXDO)DEULF
³360B03,´
3NH\[
6,'5DQJH[[I
6,'5DQJH
[
[I
9LUWXDO)DEULF³6WRUDJH´
3NH\[
6,'
[
9LUWXDO)DEULF
³5HVHUYHG´
3NH\[
6,'5DQJH[[I
'LVWULEXWHG6$
9LUWXDO)DEULF³5HVHUYHG´
3NH\[
6,'5DQJH[[I
9LUWXDO)DEULF³360B03,´
3NH\[
6,'5DQJH[[I
6,'5DQJH[
[I
9LUWXDO)DEULF³'HIDXOW´
3NH\[IIII
6,'5DQJH[[IIIIIIIIIIIIIIII
,QILQLEDQG)DEULF
/RRNLQJIRU6,'5DQJH[[I
DQG[
[I
'LVWULEXWHG6$
9LUWXDO)DEULF³'HIDXOW´
3NH\[IIII
6,'5DQJH[[IIIIIIIIIIIIIIII
/RRNLQJIRU6,'5DQJHV[[IDQG
[
[I
"
9LUWXDO)DEULF³360B03,´
3NH\[
6,'5DQJH[[I
6,'5DQJH
[
[I
QLogic Distributed Subnet Administration
Figure 3-4. Distributed SA Multiple Virtual Fabrics Configured Example

Virtual Fabrics with Overlapping Definitions

As defined, SIDs should never be shared between Virtual Fabrics. Unfortunately, it is very easy to accidentally create such overlaps. Figure 3-5 shows an example with overlapping definitions.
IB0054606-02 A 3-15
Figure 3-5. Virtual Fabrics with Overlapping Definitions
In Figure 3-5, the fabric administrator enabled the “PSM_MPI” Virtual Fabric without modifying the “Default” Virtual Fabric. As a result, the Distributed SA sees two different virtual fabrics that match its configuration file.
In Figure 3-6, the person administering the fabric has created two different Virtual Fabrics without turning off the Default - and two of the new fabrics have overlapping SID ranges.
3–InfiniBand® Cluster Setup and Administration
9LUWXDO)DEULF³'HIDXOW´
3NH\[IIII
6,'5DQJH[[IIIIIIIIIIIIIIII
,QILQLEDQG)DEULF
9LUWXDO)DEULF³'HIDXOW´
3NH\[IIII
6,'5DQJH[[I
6,'5DQJH[
[I
'LVWULEXWHG6$
9LUWXDO)DEULF³'HIDXOW´3NH\[IIII
6,'5DQJH[[IIIIIIIIIIIIIIII
/RRNLQJIRU6,'5DQJHV[[IDQG
[
[I
"
9LUWXDO)DEULF³360B03,´
,'3NH\[
6,'5DQJH[[I
6,'5DQJH
[
[I
9LUWXDO)DEULF³5HVHUYHG´
,'
3NH\[
6,'5DQJH[[I
9LUWXDO)DEULF³'HIDXOW´
3NH\[IIII
6,'5DQJH[[IIIIIIIIIIIIIIII
,QILQLEDQG)DEULF
9LUWXDO)DEULF³360B03,´
3NH\[
6,'5DQJH[[I
6,'5DQJH[
[I
'LVWULEXWHG6$
9LUWXDO)DEULF³360B03,´
3NH\[
6,'5DQJH[[I
6,'5DQJH
[
[I
QLogic Distributed Subnet Administration
Figure 3-6. Virtual Fabrics with PSM_MPI Virtual Fabric Enabled
In Figure 3-6, the administrator enabled the “PSM_MPI” fabric, and then added a new “Reserved” fabric that uses one of the SID ranges that “PSM_MPI” uses. When a path query has been received, the Distributed SA deals with these conflicts as follows:
First, any virtual fabric with a pkey of 0xffff or 0x7fff is considered to be an Admin or Default virtual fabric. This Admin or Default virtual fabric is treated as a special case by the Distributed SA and is used only as a last resort. Stored SIDs are only mapped to the default virtual fabric if they do not match any other Virtual Fabrics. Thus, in the first example, Figure 3-6, the Distributed SA will assign all the SIDs in its configuration file to the “PSM_MPI” Virtual Fabric as shown in Figure 3-7.
Figure 3-7. Virtual Fabrics with all SIDs assigned to PSM_MPI Virtual Fabric
3-16 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
NOTE
9LUWXDO)DEULF³'HIDXOW´
3NH\[IIII
6,'5DQJH[[IIIIIIIIIIIIIIII
,QILQLEDQG)DEULF
9LUWXDO)DEULF³'HIDXOW´
3NH\[IIII
6,'5DQJH[[I
6,'5DQJH[
[I
'LVWULEXWHG6$
9LUWXDO)DEULF³'HIDXOW´3NH\[IIII
6,'5DQJH[[IIIIIIIIIIIIIIII
9LUWXDO)DEULF³360B03,´ ,'3NH\[
6,'5DQJH[[I
6,'5DQJH
[
[I
9LUWXDO)DEULF³5HVHUYHG´
,'3NH\[
6,'5DQJH[[I
9LUWXDO)DEULF³360B03,´
3NH\[
6,'5DQJH[[I
6,'5DQJH[
[I
QLogic Distributed Subnet Administration
Second, the Distributed SA handles overlaps by taking advantage of the fact that Virtual Fabrics have unique numeric indexes. These indexes are assigned by the QLogic Fabric Manager in the order which the Virtual Fabrics appear in the configuration file. These indexes can be seen by using the command iba_saquery -o vfinfo command. The Distributed SA will always assign a SID to the Virtual Fabric with the lowest index, as shown in Figure 3-8. This ensures that all copies of the Distributed SA in the IB fabric will make the same decisions about assigning SIDs. However, it also means that the behavior of your fabric can be affected by the order you configured the virtual fabrics.
Figure 3-8. Virtual Fabrics with Unique Numeric Indexes
In Figure 3-8, the Distributed SA assigns all overlapping SIDs to the “PSM_MPI” fabric because it has the lowest Index

Distributed SA Configuration File

The Distributed SA configuration file is /etc/sysconfig/iba/qlogic_sa.conf. It has several settings, but normally administrators will only need to deal with two or three of them.
IB0054606-02 A 3-17
The Distributed SA makes these assignments not because they are right, but because they allow the fabric to work even though there are configuration ambiguities. The correct solution in these cases is to redefine the fabric so that no node will ever be a member of two Virtual Fabrics that service the same SID.
3–InfiniBand® Cluster Setup and Administration
NOTE
QLogic Distributed Subnet Administration
SID
The SID is the primary configuration setting for the Distributed SA, and it can be specified multiple times. The SIDs identify applications which will use the distributed SA to determine their path records. The default configuration for the Distributed SA includes all the SIDs defined in the default Qlogic Fabric Manager configuration for use by MPI.
Each SID= entry defines one Service ID that will be used to identify an application. In addition, multiple SID= entries can be specified. For example, a virtual fabric has three sets of SIDs associated with it: 0x0a1 through 0x0a3, 0x1a1 through 0x1a3 and 0x2a1 through 0x2a3. You would define this as:
SID=0x0a1
SID=0x0a2
SID=0x0a3
SID=0x1a1
SID=0x1a2
SID=0x1a3
SID=0x2a1
SID=0x2a2
SID=0x2a3
ScanFrequency
Periodically, the Distributed SA will completely re synchronize its database. This also occurs if the Fabric Manager is restarted. ScanFrequency defines the minimum number of seconds between complete re synchronizations. It defaults to 600 seconds, or 10 minutes. On very large fabrics, increasing this value can help reduce the total amount of SM traffic. For example, to set the interval to 15 minutes, add this line to the bottom of the qlogic_sa.conf file:
LogFile
Normally, the Distributed SA logs special events through syslog to /var/log/messages. This parameter allows you to specify a different destination for the log messages. For example, to direct Distributed SA messages to their own log, add this line to the bottom of the qlogic_sa.conf file:
A SID of zero is not supported at this time. Instead, the OPP libraries treat zero values as "unspecified".
ScanFrequency=900
LogFile=/var/log/SAReplica.log
3-18 IB0054606-02 A
Dbg
3–InfiniBand® Cluster Setup and Administration
QLogic Distributed Subnet Administration
This parameter controls how much logging the Distributed SA will do. It can be set to a number between one and seven, where one indicates no logging and seven includes informational and debugging messages. To change the Dbg setting for Distributed SA, find the line in qlogic_sa.conf that reads Dbg=5 and change it to a different value, between 1 and 7. The value of Dbg changes the amount of logging that the Distributed SA generates as follows:
Dbg=1 or Dbg=2: Alerts and Critical Errors
Only errors that will cause the Distributed SA to terminate will be reported.
Dbg=3: Errors
Errors will be reported, but nothing else. (Includes Dbg=1 and Dbg=2)
Dbg=4: Warnings
Dbg=5: Normal
Dbg=6: Informational Messages
Dbg=7: Debugging
Other Settings
The remaining configuration settings for the Distributed SA are generally only useful in special circumstances and are not needed in normal operation. The sample qlogic_sa.conf configuration file contains a brief description of each.
Errors and warnings will be reported. (Includes Dbg=3)
Some normal events will be reported along with errors and warnings. (Includes Dbg=4)
In addition to the normal logging, Distributed SA will report detailed information about its status and operation. Generally, this will produce too much information for normal use. (Includes Dbg=5)
This should only be turned on at the request of QLogic Support. This will generate so much information that system operation will be impacted. (Includes Dbg=6)
IB0054606-02 A 3-19
3–InfiniBand® Cluster Setup and Administration Changing the MTU Size

Changing the MTU Size

The Maximum Transfer Unit (MTU) size enabled by the IB HCA and set by the driver is 4KB. To see the current MTU size, and the maximum supported by the adapter, type the command:
$ ibv_devinfo
If the switches are set at 2K MTU size, then the HCA will automatically use this as the active MTU size, there is no need to change any file on the hosts.
To ensure that the driver on this host uses 2K MTU, add the following options line (as a root user) in to the configuration file:
options ib_qib ibmtu=4
Table 3-1 shows the value of each ibmtu number designation.
Table 3-1. ibmtu Values
Number Designation Value in Bytes
1 256
2512
3 1024
4 2048
5 4096
The following is a list of the configuration file locations for each OS:
For RHEL 5.x use file: /etc/modprobe.conf
For SLES 10 or 11 use file: /etc/modprobe.conf.local
For RHEL 6.x use file:/etc/modprobe.d/ib_qib.conf
Restart the driver as described in Managing the ib_qib Driver.
3-20 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
NOTE

Managing the ib_qib Driver

To use 4K MTU, set the switch to have the same 4K default. If you are using QLogic switches, the following applies:
For the Externally Managed 9024, use 4.2.2.0.3 firmware
(
9024DDR4KMTU_firmware.emfw) for the 9024 EM. This has the 4K MTU
default, for use on fabrics where 4K MTU is required. If 4K MTU support is not required, then use the externally-managed switches. Use FastFabric (FF) to load the firmware on all the 9024s on the fabric.
For the 9000 chassis, use the most recent 9000 code 4.2.4.0.1. The 4K
MTU support is in 9000 chassis version 4.2.1.0.2 and later. For the 9000 chassis, when the FastFabric 4.3 (or later) chassis setup tool is used, the user is asked to select an MTU. FastFabric can then set that MTU in all the 9000 internally managed switches. The change will take effect on the next reboot. Alternatively, for the internally managed 9000s, the
ismChassisSetMtu Command Line Interface (CLI) command can be
used. This should be executed on every switch and both hemispheres of the 9240s.
4.2.2.0.2 DDR *.emfw file for DDR
For the 12000 switches, refer to the QLogic FastFabric User Guide for
externally managed switches, and to the QLogic FastFabric CLI Reference Guide for the internally managed switches.
For reference, see the QLogic FastFabric User Guide and the QLogic 12000 CLI Reference Guide. Both are available from the QLogic web site.
For other switches, see the vendors’ documentation.
Managing the ib_qib Driver
The startup script for ib_qib is installed automatically as part of the software installation, and normally does not need to be changed. It runs as a system service.
The primary configuration file for the IB driver ib_qib and other modules and associated daemons is
Normally, this configuration file is set up correctly at installation and the drivers are loaded automatically during system boot once the software has been installed. However, the ib_qib driver has several configuration variables that set reserved buffers for the software, define events to create trace records, and set the debug level.
/etc/infiniband/openib.conf.
If you are upgrading, your existing configuration files will not be overwritten.
IB0054606-02 A 3-21
3–InfiniBand® Cluster Setup and Administration
NOTE
!
WARNING
Managing the ib_qib Driver
See the ib_qib man page for more details.

Configure the ib_qib Driver State

Use the following commands to check or configure the state. These methods will not reboot the system.
To check the configuration state, use this command. You do not need to be a root user:
$ chkconfig --list openibd
To enable the driver, use the following command (as a root user):
# chkconfig openibd on 2345
To disable the driver on the next system boot, use the following command (as a root user):
# chkconfig openibd off
This command does not stop and unload the driver if the driver is already loaded nor will it start the driver.

Start, Stop, or Restart ib_qib Driver

Restart the software if you install a new QLogic OFED+ Host Software release, change driver options, or do manual testing.
QLogic recommends using /etc/init.d/openibd to stop, stat and restart the ib_qib driver. For using the command line to stop, start, and restart (as a root user) the ib_qib driver use the following syntex:
# /etc/init.d/openibd [start | stop | restart]
If QLogic Fabric Manager, or OpenSM is configured and running on the node, it must be stopped before using the may be started after using the
This method will not reboot the system. The following set of commands shows how to use this script.
When you need to determine which ib_qib driver and OpenFabrics modules are running, use the following command. You do not need to be a root user.
$ lsmod | egrep ’ipath_|ib_|rdma_|findex’
openibd start command.
openibd stop command, and
3-22 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
You can check to see if opensmd is configured to autostart by using the following command (as a root user); if there is no output, autostart:
# /sbin/chkconfig --list opensmd | grep -w on

Unload the Driver/Modules Manually

You can also unload the driver/modules manually without using
/etc/init.d/openibd. Use the following series of commands (as a root user):
# umount /ipathfs
# fuser -k /dev/ipath* /dev/infiniband/*
# lsmod | egrep ’^ib_|^rdma_|^iw_’ | xargs modprobe -r

ib_qib Driver Filesystem

The ib_qib driver supplies a filesystem for exporting certain binary statistics to user applications. By default, this filesystem is mounted in the when the ib_qib script is invoked with the The filesystem is unmounted when the ib_qib script is invoked with the option (for example, at system shutdown).
Managing the ib_qib Driver
opensmd is not configured to
/ipathfs directory
start option (e.g. at system startup).
stop
Here is a sample layout of a system with two cards:
/ipathfs/0/flash
/ipathfs/0/port2counters
/ipathfs/0/port1counters
/ipathfs/0/portcounter_names
/ipathfs/0/counter_names
/ipathfs/0/counters
/ipathfs/driver_stats_names
/ipathfs/driver_stats
/ipathfs/1/flash
/ipathfs/1/port2counters
/ipathfs/1/port1counters
/ipathfs/1/portcounter_names
IB0054606-02 A 3-23
3–InfiniBand® Cluster Setup and Administration More Information on Configuring and Loading Drivers
/ipathfs/1/counter_names
/ipathfs/1/counters
The
driver_stats file contains general driver statistics. There is one numbered
subdirectory per IB device on the system. Each numbered subdirectory contains the following per-device files:
port1counters
port2counters
flash
driver1counters and driver2counters files contain counters for the
The device, for example, interrupts received, bytes and packets in and out, etc. The
flash file is an interface for internal diagnostic commands.
The file in the binary names for the stats in the binary
counter_names provides the names associated with each of the counters
port#counters files, and the file driver_stats_names provides the
driver_stats files.

More Information on Configuring and Loading Drivers

See the modprobe(8), modprobe.conf(5), and lsmod(8) man pages for more information. Also see the file for more general information on configuration files.
/usr/share/doc/initscripts-*/sysconfig.txt

Performance Settings and Management Tips

The following sections provide suggestions for improving performance and simplifying cluster management. Many of these settings will be done by the system administrator.
3-24 IB0054606-02 A

Performance Tuning

NOTE
Tuning compute or storage (client or server) nodes with IB HCAs for MPI and verbs performance can be accomplished in several ways:
Run the ipath_perf_tuning script in automatic mode (See
“Performance Tuning using ipath_perf_tuning Tool” on page 3-34)
(easiest method)
Run the ipath_perf_tuning script in interactive mode (See
“Performance Tuning using ipath_perf_tuning Tool” on page 3-34 or
see man ipath_perf_tuning). This interactive mode allows more control, and should be used for tuning storage (client or server) nodes.
Make changes to ib_qib driver parameter files, the BIOS or system
services using the information provided in the following sections
3–InfiniBand® Cluster Setup and Administration
Performance Settings and Management Tips
The modprobe configuration file (modprobe.conf) will be used in this section for the ib_qib module configuration file, which has various paths and names in the different Linux distributions as shown in the following list:
For RHEL 5.x use file: /etc/modprobe.confFor SLES 10 or 11 use file: /etc/modprobe.conf.local
For RHEL 6.x use file:/etc/modprobe.d/ib_qib.conf
Systems in General (With Either Intel or AMD CPUs)
For best performance on dual-port HCAs on which only one port is active, the module parameter line in the modprobe.conf file should include the following:
options ib_qib singleport=1
Services
Turn off the specified daemons using one of the following commands according to which OS is being used:
For RHEL or similar systems use:
/sbin/chkconfig --level 12345 cpuspeed off
For SLES systems use:
/sbin/chkconfig --level 12345 powersaved off
IB0054606-02 A 3-25
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips
If cpuspeed or powersaved are being used as part of implementing Turbo modes to increase CPU speed, then they can be left on. With these daemons left on, IB micro-benchmark performance results may be more variable from run-to-run.
For compute nodes, set the default runlevel to 3 to reduce overheads due to unneeded processes. Reboot the system for this change to take effect.
Default Parameter Settings
The qib driver makes certain settings by default based on a check of which CPUs are in the system. Since these are done by default, no user- or ipath_perf_tuning-generated changes need to be made in the modprobe configuration file. It doesn't hurt anything if these settings are in the file, but they are not necessary.
On all systems, the qib driver behaves as if the following parameters were set:
rcvhdrcnt=4096
If you run a script, such as the following:
for x in /sys/module/ib_qib/parameters/*; do echo $(basename $x) $(cat $x); done
Then in the list of qib parameters, you should see the following parameter being discussed:
. . .
rcvhdrcnt 0
The 0 means the driver automatically sets these parameters. Therefore, neither the user nor the ipath_perf_tuning script should modify these parameters.
Compute-only Node (Not part of a parallel file system cluster)
No tuning is required, other than what is in the Systems in General (With Either
Intel or AMD CPUs) section.
For more details on settings that are specific to either Intel or AMD CPUs, refer to the following sections for details on systems with those types of CPUs.
Storage Node (for example, Lustre/GPFS client or server node)
Although termed a “Storage Node” this information includes nodes that are primarily compute nodes, but also act as clients of a parallel file server.
3-26 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
Performance Settings and Management Tips
Increasing the number of kernel receive queues allows more CPU cores to be involved in the processing of verbs traffic. This is important when using parallel file systems such as Lustre or IBM's GPFS (General Parallel File System). The module parameter that sets this number is krcvqs. Each additional kernel receive queue (beyond the one default queue for each port) takes user contexts away from PSM and from the support of MPI or compute traffic. The formula which illustrates this trade-off is:
PSM Contexts = 16 - (krcvqs-1)x num_ports
Where number_ports is the number of ports on the HCA
For example, on a single-port card with krcvqs=4 set in modprobe.conf:
PSM Contexts = 16 - (4-1)x 1 = 16 - 3 = 13
If this were a 12-core node, then 13 is more than enough PSM contexts to run an MPI process on each core without making use of context-sharing. An example ,
ib_qib options line in the modprobe.conf file, for this 12-core node case is:
options ib_qib singleport=1 krcvqs=4
Table 3-2 can be used as a guide for setting the krcvqs parameter for the
number of cores in the system supporting PSM processes and the number of ports in the HCA. The table applies most readily to nodes with 1 HCA being used to support PSM (for example, MPI or SHMEM) processes. For nodes with multiple HCAs that are being used for PSM, the table decide the maximum number of cores that will be assigned on each HCA to support PSM (MPI or SHMEM) processes, then apply the table to each HCA in turn.
Table 3-2. krcvqs Parameter Settings
Cores per Node (to
be used for
MPI/PSM on 1
1-port, Set krcvqs=
HCA):
61-64
57-60 2
53-56 3 2,1 (2 for port 1, 1 for one
12-52 4 2
8-11 3 2,1 (2 for port 1, 1 for one
4-7 2
1-3
a
1
a
1
2 active ports in the HCA,
Set krcvqs=
a
1
a
1
port)
port)
a
1
a
1
IB0054606-02 A 3-27
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips
a
1 is the default setting, so if the table recommends '1', krcvqs does not need to be set.
In the rare case that the node has more than 64 cores, and it is desired to run MPI on more than 64 cores, then two HCAs are required and settings can be made, using the rules in Table 3-2, as though half the cores were assigned to each HCA.
AMD CPU Systems
To improve IPoIB and other verbs-based throughput performance, on AMD CPU systems, QLogic recommends setting pcie_caps=0x51 numa_aware=1 as modprobe configuration file parameters. For example, the module parameter line in the modprobe configuration file should include the following for AMD Opteron CPUs:
options ib_qib pcie_caps=0x51 numa_aware=1
On AMD systems, the pcie_caps=0x51 setting will result in a line of the lspci -vv output associated with the QLogic HCA reading in the "DevCtl" section:
MaxPayload 128 bytes, MaxReadReq 4096 bytes.
AMD Interlagos CPU Systems
With AMD Interlagos (Opteron 6200 Series) CPU systems, better performance will be obtained if, on single-HCA systems, the HCA is put in a PCIe slot closest to Socket number 1. You can typically find out which slots these are by looking at the schematics in the manual for your motherboard. (There is currently a BIOS or kernel problem which implies that no NUMA topology information is available from the kernel.)
To obtain top “Turbo boosts” of up to 1GHz in clock rate, when running on half the cores of a node, AMD recommends enabling the C6 C-state in the BIOS. Some applications (but certainly not all) run better when running on half the cores or a Interlagos node (on every other core, one per Bulldozer module). QLogic recommends enabling this C-state in the BIOS.
Intel CPU Systems
Typical tuning for recent Intel CPUs
For recent Intel CPUs (code-named Sandy Bridge, Westmere or Nehalem), set the following BIOS parameters:
Disable all C-States.
Disable Intel Hyper-Threading technology
3-28 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
Performance Settings and Management Tips
For setting all C-States to 0 where there is no BIOS support:
1. Add kernel boot option using the following command:
processor.max_cstate=0
2. Reboot the system.
If the node uses a single-port HCA, and is not a part of a parallel file system cluster, there is no need for performance tuning changes to a modprobe configuration file. The driver will automatically set the parameters appropriately for the node's Intel CPU, in a conservative manner.
For all Intel systems with Xeon 5500 Series (Nehalem) or newer CPUs, the following settings are default:
pcie_caps=0x51
On Intel systems with Xeon 5500 Series (Nehalem) or newer CPUs, the lspci output will read:
MaxPayload 256 bytes, MaxReadReq 4096 bytes
If you run a script, such as the following:
for x in /sys/module/ib_qib/parameters/*; do echo $(basename $x) $(cat $x); done
Then in the list of qib parameters, you should see the following for the two parameters being discussed:
. . .
rcvhdrcnt 0
. . .
pcie_caps 0
The 0 means the driver automatically sets these parameters. Therefore, neither the user nor the ipath_perf_tuning script should modify these parameters.
Intel Nehalem or Westmere CPU Systems (DIMM Configuration)
Compute node memory bandwidth is important for high-performance computing (HPC) application performance and for storage node performance. On Intel CPUs code named Nehalem or Westmere (Xeon 5500 series or 5600 series) it is important to have an equal number of dual in-line memory modules (DIMMs) on each of the three memory channels for each CPU. On the common dual CPU systems, you should use a multiple of six DIMMs for best performance.
IB0054606-02 A 3-29
3–InfiniBand® Cluster Setup and Administration
NOTE
Performance Settings and Management Tips
High Risk Tuning for Intel Harpertown CPUs
For tuning the Harpertown generation of Intel Xeon CPUs that entails a higher risk factor, but includes a bandwidth benefit, the following can be applied:
For nodes with Intel Harpertown, Xeon 54xx CPUs, you can add pcie_caps=0x51 and pcie_coalesce=1 to the modprobe.conf file. For example:
options ib_qib pcie_caps=0x51 pcie_coalesce=1
If the following problem is reported by syslog, a typical diagnostic can be performed, which is described in the following paragraphs:
[PCIe Poisoned TLP][Send DMA memory read]
Another potential issue is that after starting openibd, messages such as the following appear on the console:
Message from syslogd@st2019 at Nov 14 16:55:02 ...
kernel:Uhhuh. NMI received for unknown reason 3d on CPU 0
After this happens, you may also see the following message in the syslog:
Mth dd hh:mm:ss st2019 kernel: ib_qib 0000:0a:00.0: infinipath0:
Fatal Hardware Error, no longer usable, SN AIB1013A43727
These problems typically occur on the first run of an MPI program running over the PSM transport or immediately after the link becomes active. The adapter will be unusable after this situation until the system is rebooted. To resolve this issue try the following solutions in order:
Remove pcie_coalesce=1
Restart openibd and try the MPI program again
Remove both
pcie_caps=0x51 and pcie_coalesce=1 options from the
ib_qib line in modprobe.conf file and reboot the system
Removing both options will technically avoid the problem but can result in an unnecessary performance decrease. If the system has already failed with the above diagnostic it will need to be rebooted. Note that in modprobe.conf file all options for a particular kernel module must be on the same line and not on repeated options ib_qib lines.
3-30 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
Performance Settings and Management Tips
Additional Driver Module Parameter Tunings Available
Setting driver module parameters on Per-unit or Per-port basis
The ib_qib driver allows the setting of different driver parameter values for the individual HCAs and ports. This allows the user to specify different values for each port on a HCA or different values for each HCA in the system. This feature is used when there is a need to tune one HCA or port for a particular type of traffic, and a different HCA or port for another type of traffic, for example, compute versus storage traffic.
Not all driver parameters support per-unit or per-port values. The driver parameters which can be used with the new syntax are listed below:
Per-unit parameters:
singleport – Use only IB port 1; more per-port buffer space
cfgctxts – Set max number of contexts to use
pcie_caps – Max PCIe tuning: MaxPayload, MaxReadReq
Per-port parameters:
ibmtu – Set max IB MTU
krcvqs – number of kernel receive queues
num_vls – Set number of Virtual Lanes to use
Specifying individual unit/port values is done by using a specific module parameter syntax:
param name=[default,][unit[:port]=value]
Where:
param name is the driver module parameter name (listed above)
default is the default value for that parameter. This value will be
used for all remaining units/port which have not had individual values set. If no individual unit/port values have been specified, the default value will be used for all units/ports
unit is the index of the HCA unit (as seen by the driver). This value is
0-based (index of first unit is '0').
port is the port number on that HCA. This value is 1-based (number
of first port is '1').
IB0054606-02 A 3-31
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips
value is the parameter value for the particular unit or port.
The fields in the square brackets are options; however, either a default or a per-unit/per-port value is required.
Example usage:
To set the default IB MTU to 1K for all ports on all units:
ibmtu=3
To set the IB MTU to 256-bytes for unit 0/port 1 and 4096-bytes for unit 0/port 2:
ibmtu=0:1=1,0:2=5
To set the default IB MTU to 2K for all ports but specify 4K for unit 0/port 1:
ibmtu=4,0:1=5
To singleport to OFF as default and turn it ON for unit 1:
singleport=0,1=1
To set number of configured contexts to 10 on unit 0 and 16 on unit 1:
cfgctxts=0=10,1=16
A user can identify HCAs and correlate them to system unit numbers by using the
-b option (beacon mode option) to the ipath_control script. Issuing the
following command (as root):
ipath_control -u unit -b on
Where:
unit is the system unit number will cause that HCA to start blinking the LEDs on the face of the board in an alternating pattern.
Once the board has been identified, the user can return the LEDs to normal mode of operation with the following command (as root):
ipath_control -u unit -b off
numa_aware
The Non-Uniform Memory Access (NUMA) awareness (numa_aware) module parameter enables driver memory allocations in the same memory domain or NUMA node of the HCA. This improves the overall system efficiency with CPUs on the same NUMA node having faster access times and higher bandwidths to memory.
The default is:
option ib_qib numa_aware=10
3-32 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
Performance Settings and Management Tips
This command lets the driver automatically decide on the allocation behavior and disables this feature on platforms with AMD and Intel Westmere-or-earlier CPUs, while enabling it on newer Intel CPUs.
Tunable options:
option ib_qib numa_aware=0
This command disables the NUMA awareness when allocating memories within the driver. The memory allocation requests will be satisfied on the node's CPU that executes the request.
option ib_qib numa_aware=1
This command enables this feature with the driver allocating memory on the NUMA node closest to the HCA.
recv_queue_size, Tuning Related to NAKs
The Receiver Not Ready Negative Acknowledgement (RNR NAKs) can slow IPoIB down significantly. IB is fast enough to overrun IPoIB's receive queue before the post receives can occur.
The counter to look for on the sending side in this file is RC RNR NAKs as shown in the following example:
# cat /sys/class/infiniband/qib0/stats
Port 1:
RC timeouts 0
RC resends 0
RC QACKs 0
RC SEQ NAKs 0
RC RDMA seq 0
RC RNR NAKs 151 <---------
RC OTH NAKs 0
. . .
Ctx:npkts 0:170642806
Check the RC RNR NAKs before and after running the IPoIB test to see if that counter is increasing. If so, then increasing IPoIB's recv_queue_size to 512 in the ib_ipoib.conf file should eliminate RNR NAKs.
IB0054606-02 A 3-33
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips
For example:
# cat /etc/modprobe.d/ib_ipoib.conf
alias ib0 ib_ipoib
alias ib1 ib_ipoib
options ib_ipoib recv_queue_size=512

Performance Tuning using ipath_perf_tuning Tool

The ipath_perf_tuning tool is intended to adjust parameters to the IB QIB driver to optimize the IB and application performance. The tool is designed to be run once per installation, however it can be re-run if changes to the configuration need to be made. Changes are made to the appropriate modprobe file depending on Linux distribution (see Affected Files).
The tool takes into account the type of the node being configured and can be run in one of two modes - automatic (the default) and interactive. In automatic mode, the tool will make the parameter adjustments without the need for any user input. Interactive mode will prompt the user for input on some of the settings and actions.
Table 3-3 list the checks the tool performs on the system on which it is run.
Table 3-3. Checks Preformed by ipath_perf_tuning Tool
Check Type Description
pcie_caps Adjust PCIe tuning for max payload and read request size.
The result of this test depends on the CPU type of the node.
singleport Determine whether to run the HCA in single port mode
increasing the internal HCA resources for that port. This set­ting depends on the user’s input and is only performed in interactive mode.
krcvqs Determine the number of kernel receive context to allocate.
Normally, the driver allocates one context per physical port. However, more kernel receive contexts can be allocated to improve Verbs performance.
pcie_coalesce Enable PCIe coalescing. PCIe coalescing is only needed or
enabled on some systems with Intel Harpertown CPUs.
cache_bypass_copy Enable the use of Cache bypass copies. This option is
enabled on AMD Interlagos (62xx) series processors.
numa_aware Enable NUMA-aware memory allocations. This option is
enabled on AMD CPUs only.
3-34 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
Performance Settings and Management Tips
Table 3-3. Checks Preformed by ipath_perf_tuning Tool
Check Type Description
cstates Check whether (and which) C-States are enabled. C-States
should be turned off for best performance.
services Check whether certain system services (daemons) are
enabled. These services should be turned off for best perfor­mance.
The values picked for the various checks and tests may depend on the type of node being configured. The tool is aware of two types of nodescompute and storage nodes.
Compute Nodes
Compute nodes are nodes which should be optimized for faster computation and communication with other compute nodes.
OPTIONS
Storage (Client or Server) Nodes
Storage nodes are nodes which serve as clients or servers in a parallel filesystem network. Storage nodes (especially clients) are typically performing computation and using MPI, in addition to sending and receiving storage network traffic. The objective is to improve IB verbs communications while maintaining good MPI performance.
Table 3-4 list the options for the ipath_perf_tuning tool and describes each
option.
Table 3-4. ipath_perf_tuning Tool Options
Option Description
-h Display a short multi-line help message
-T test This option is used to limit the list of tests/check which the tool per-
forms to only those specified by the option. Multiple tests can be speci­fied as a comma-separated list.
-I Run the tool in interactive mode. In this mode, the tool will prompt the user for input on certain tests.
IB0054606-02 A 3-35
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips
AUTOMATIC vs. INTERACTIVE MODE
The tool performs different functions when running in automatic mode compared to running in the interactive mode. The differences include the node type selection, test execution, and applying the results of the executed tests.
Node Type Selection
The tool is capable of configuring compute nodes or storage nodes (see Compute
Nodes and Storage (Client or Server) Nodes). When the tool is executed in
interactive mode, it will query the user for the type of node. When the tool is running in automatic mode, it assumes that the node being configured is a compute node.
Test Execution
The main difference between the two test modes is that some of the tests are effectively skipped when the tool is in automatic mode. This is done, due to the fact, that these test do not provide a guaranteed universal performance gain and therefore, changing driver parameters associated with them requires user approval. Other tests, where the tool can make a safe determination, are performed in both modes without any user interaction. Table 3-5 list the test and describe the mode(s) for each.
Table 3-5. Test Execution Modes
Test Mode
pcie_caps Test is performed in both modes without any user
interaction.
singleport Test is only performed in interactive mode. The user
is queried whether to enable singleport mode.
krcvqs Test is performed in both modes without any user
interaction.
pci_coalesce Test is performed only in interactive mode. The user
is queried whether to enable PCIe coalescing.
cache_bypass_copy Test is performed in both modes without any user
interaction.
num_aware Test is performed in both modes without any user
interaction.
cstates Test is performed in both modes but the user is only
notified of a potential issue if the tool is in interactive mode. In that case, the tool displays a warning and a suggestion on how to fix the issue.
3-36 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
Performance Settings and Management Tips
Table 3-5. Test Execution Modes
Test Mode
services Test is performed in both modes but the user is noti-
fied of running services only if the tool is in interac­tive mode. In that case, the user is queried whether to turn the services off.
Applying the Results
Automatic mode versus interactive mode also has an effect when the tool is committing the changes to the system. Along with the necessary driver parameters, the script also writes a comment line in the appropriate file which serves as a marker. This marker contains the version of the script which is making the changes. If the version recorded matches the version of the script currently being run, the changes are only committed if the tool is in interactive mode. The assumption is that the script is being re-run by the user to make adjustments.
Affected Files
The following lists the distribution and the file that is modified by the ipath_perf_tuning tool:
RHEL 6.0 and later /etc/modprobe.d/ib_qib.conf
RHEL prior to 6.0 /etc/modprobe.conf
SLES – /etc/modprobe.conf.local

Homogeneous Nodes

To minimize management problems, the compute nodes of the cluster should have very similar hardware configurations and identical software installations. A mismatch between the software versions can also cause problems. Old and new libraries must not be run within the same job. It may also be useful to distinguish between the IB-specific drivers and those that are associated with kernel.org, OpenFabrics, or are distribution-built. The most useful tools are:
ident (see “ident” on page G-24)
ipathbug-helper (see “ipath_checkout” on page G-25)
ipath_checkout (see “ipath_checkout” on page G-25) ipath_control (see “ipath_control” on page G-27)
 
mpirun (see “mpirun” on page G-31)
IB0054606-02 A 3-37
3–InfiniBand® Cluster Setup and Administration
NOTE
NOTE
Performance Settings and Management Tips
rpm (see “rpm” on page G-32)
strings (see “strings” on page G-32)
Run these tools to gather information before reporting problems and requesting support.

Adapter and Other Settings

The following adapter and other settings can be adjusted for better performance.
For the most current information on performance tuning refer to the QLogic OFED+ Host Software Release Notes.
Use an IB MTU of 4096 bytes instead of 2048 bytes, if available, with
the QLE7340, and QLE7342. 4K MTU is enabled in the ib_qib driver by
default. To change this setting for the driver, see “Changing the MTU Size”
on page 3-20.
Make sure that write combining is enabled. The x86 Page Attribute Table
(PAT) mechanism that allocates Write Combining (WC) mappings for the PIO buffers has been added and is now the default. If PAT is unavailable or PAT initialization fails for some reason, the code will generate a message in the log and fall back to the MTRR mechanism. See Appendix F Write
Combining for more information.
Check the PCIe bus width. If slots have a smaller electrical width than
mechanical width, lower than expected performance may occur. Use this command to check PCIe Bus width:
$ ipath_control -iv
This command also shows the link speed.
Experiment with non-default CPU affinity while running
single-process-per-node latency or bandwidth benchmarks. Latency
may be slightly lower when using different CPUs (cores) from the default. On some chipsets, bandwidth may be higher when run from a non-default CPU or core. For the MPI being used, look at its documentation to see how to force a benchmark to run with a different CPU affinity than the default. With OFED micro benchmarks such as from the qperf or perftest suites, taskset will work for setting CPU affinity.
3-38 IB0054606-02 A

Remove Unneeded Services

The cluster administrator can enhance application performance by minimizing the set of system services running on the compute nodes. Since these are presumed to be specialized computing appliances, they do not need many of the service daemons normally running on a general Linux computer.
Following are several groups constituting a minimal necessary set of services. These are all services controlled by enabled, use the command:
$ /sbin/chkconfig --list | grep -w on
Basic network services are:
network ntpd syslog xinetd
3–InfiniBand® Cluster Setup and Administration
Performance Settings and Management Tips
chkconfig. To see the list of services that are
sshd
For system housekeeping, use:
anacron atd
crond
If you are using Network File System (NFS) or yellow pages (yp) passwords:
rpcidmapd ypbind portmap
nfs
nfslock
autofs
To watch for disk problems, use:
smartd
readahead
The service comprising the ib_qib driver and SMA is:
openibd
IB0054606-02 A 3-39
3–InfiniBand® Cluster Setup and Administration
NOTE

Host Environment Setup for MPI

Other services may be required by your batch queuing system or user community.
If your system is running the daemon off. Disabling that use interrupts. Use this command:
# /sbin/chkconfig irqbalance off
See “Erratic Performance” on page D-10 for more information.
irqbalance will enable more consistent performance with programs
irqbalance, QLogic recommends turning it
Host Environment Setup for MPI
After the QLogic OFED+ Host software and the GNU (GCC) compilers have been installed on all the nodes, the host environment can be set up for running MPI programs.

Configuring for ssh

Running MPI programs with the command mpirun on an IB cluster depends, by default, on secure shell
To use Signal Algorithm (DSA) keys, public and private. The public keys must be distributed and stored on all the compute nodes so that connections to the remote machines can be established without supplying a password.
ssh, you must have generated Rivest, Shamir, Adleman (RSA) or Digital
ssh to launch node programs on the nodes.
You or your administrator must set up the cluster. There are two methods for setting up ssh on your cluster. The first method, the administrator. The second method, using accomplished by an individual user.
rsh can be used instead of ssh. To use rsh, set the environment
shosts.equiv mechanism, is typically set up by the cluster
variable information on setting environment variables. Also see “Shell Options”
on page A-6 for information on setting shell options in
rsh has a limit on the number of concurrent connections it can have,
typically 255, which may limit its use on larger clusters.
MPI_SHELL=rsh. See “Environment Variables” on page 4-18 for
Configuring ssh and sshd Using shosts.equiv
This section describes how the cluster administrator can set up ssh and sshd through the that your cluster is behind a firewall and accessible only to trusted users.
shosts.equiv mechanism. This method is recommended, provided
ssh keys and associated files on the
ssh-agent, is more easily
mpirun.
3-40 IB0054606-02 A
3–InfiniBand® Cluster Setup and Administration
Host Environment Setup for MPI
“Configuring for ssh Using ssh-agent” on page 3-43 shows how an individual user
can accomplish the same thing using
ssh-agent.
The example in this section assumes the following:
Both the cluster nodes and the front end system are running the
openssh
package as distributed in current Linux systems.
All cluster end users have accounts with the same account name on the
front end and on each node, by using Network Information Service (NIS) or another means of distributing the password file.
The front end used in this example is called
Root or superuser access is required on
configure
ssh, including the host’s key, has already been configured on the system
ip-fe. See the sshd and ssh-keygen man pages for more information.
To use
shosts.equiv to configure ssg and sshd:
1. On the system
/etc/ssh/ssh_config file to allow host-based authentication. Specifically,
ssh.
ip-fe (the front end node), change the
this file must contain the following four lines, all set to
ip-fe.
ip-fe and on each node to
yes. If the lines are
already there but commented out (with an initial #), remove the #.
RhostsAuthentication yes
RhostsRSAAuthentication yes
HostbasedAuthentication yes
EnableSSHKeysign yes
2. On each of the IB node systems, create or edit the file
/etc/ssh/shosts.equiv, adding the name of the front end system. Add the
line:
ip-fe
Change the file to mode 600 when you are finished editing.
IB0054606-02 A 3-41
3–InfiniBand® Cluster Setup and Administration
NOTE
Host Environment Setup for MPI
3. On each of the IB node systems, create or edit the file
/etc/ssh/ssh_known_hosts. You will need to copy the contents of the file /etc/ssh/ssh_host_dsa_key.pub from ip-fe to this file (as a single line),
and then edit that line to insert This is very similar to the standard line might look like this (displayed as multiple lines, but a single line in the file):
ip-fe ssh-dss AAzAB3NzaC1kc3MAAACBAPoyES6+Akk+z3RfCkEHCkmYuYzqL2+1nwo4LeTVW pCD1QsvrYRmpsfwpzYLXiSJdZSA8hfePWmMfrkvAAk4ueN8L3ZT4QfCTwqvHV vSctpibf8n aUmzloovBndOX9TIHyP/Ljfzzep4wL17+5hr1AHXldzrmgeEKp6ect1wxAAAA FQDR56dAKFA4WgAiRmUJailtLFp8swAAAIBB1yrhF5P0jO+vpSnZrvrHa0Ok+ Y9apeJp3sessee30NlqKbJqWj5DOoRejr2VfTxZROf8LKuOY8tD6I59I0vlcQ 812E5iw1GCZfNefBmWbegWVKFwGlNbqBnZK7kDRLSOKQtuhYbGPcrVlSjuVps fWEju64FTqKEetA8l8QEgAAAIBNtPDDwdmXRvDyc0gvAm6lPOIsRLmgmdgKXT GOZUZ0zwxSL7GP1nEyFk9wAxCrXv3xPKxQaezQKs+KL95FouJvJ4qrSxxHdd1 NYNR0DavEBVQgCaspgWvWQ8cL 0aUQmTbggLrtD9zETVU5PCgRlQL6I3Y5sCCHuO7/UvTH9nneCg==
ip-fe ssh-dss at the beginning of the line.
known_hosts file for ssh. An example
Change the file to mode 600 when you are finished editing.
4. On each node, the system file
/etc/ssh/sshd_config must be edited, so
that the following four lines are uncommented (no # at the start of the line) and set to yes. (These lines are usually there, but are commented out and set to no by default.)
RhostsAuthentication yes
RhostsRSAAuthentication yes
HostbasedAuthentication yes
PAMAuthenticationViaKbdInt yes
5. After creating or editing the three files in Steps 2, 3, and 4, sshd must be restarted on each system. If you are already logged in via user is logged in via
ssh), their sessions or programs will be terminated, so
restart only on idle nodes. Type the following (as root) to notify
ssh (or any other
sshd to use
the new configuration files:
# killall -HUP sshd
This command terminates all ssh sessions into that system. Run from the console, or have a way to log into the console in case of any problem.
3-42 IB0054606-02 A
At this point, any end user should be able to login to the ip-fe front end system and use
ssh to login to any IB node without being prompted for a password or
pass phrase.
Configuring for ssh Using ssh-agent
The ssh-agent, a daemon that caches decrypted private keys, can be used to store the keys. Use When
ssh establishes a new connection, it communicates with ssh-agent to
acquire these keys, rather than prompting you for a passphrase.
The process is described in the following steps:
1. Create a key pair. Use the default file name, and be sure to enter a passphrase.
$ ssh-keygen -t rsa
2. Enter a passphrase for your key pair when prompted. Note that the key agent does not survive X11 logout or system reboot:
$ ssh-add
ssh-add to add your private keys to ssh-agent’s cache.
3–InfiniBand® Cluster Setup and Administration
Host Environment Setup for MPI
3. The following command tells ssh that your key pair should let you in:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Edit the ~/.ssh/config file so that it reads like this:
Host*
ForwardAgent yes
ForwardX11 yes
CheckHostIP no
StrictHostKeyChecking no
This file forwards the key agent requests back to your desktop. When you log into a front end node, you can use
ssh to compute nodes without
passwords.
4. Follow your administrator’s cluster policy for setting up machine where you will be running start the
ssh-agent by adding the following line to your ~/.bash_profile
ssh commands. Alternatively, you can
ssh-agent on the
(or equivalent in another shell):
eval ‘ssh-agent‘
Use back quotes rather than single quotes. Programs started in your login shell can then locate the
ssh-agent and query it for keys.
IB0054606-02 A 3-43
3–InfiniBand® Cluster Setup and Administration Checking Cluster and Software Status
5. Finally, test by logging into the front end node, and from the front end node to a compute node, as follows:
$ ssh frontend_node_name
$ ssh compute_node_name
For more information, see the man pages for ssh(1), ssh-keygen(1),
ssh-add(1), and ssh-agent(1).

Process Limitation with ssh

Process limitation with ssh is primarily an issue when using the mpirun option
-distributed=off. The default setting is now -distributed=on; therefore, in
most cases,
-distributed=off case is described in the following paragraph.
ssh process limitations will not be encountered. This limitation for the
MPI jobs that use more than 10 processes per node may encounter an throttling mechanism that limits the amount of concurrent per-node connections to 10. If you need to use more processes, you or your system administrator must increase the value of
MaxStartups in your /etc/ssh/sshd_config file.

Checking Cluster and Software Status

ipath_control

IB status, link speed, and PCIe bus width can be checked by running the program
ipath_control. Sample usage and output are as follows:
$ ipath_control -iv
QLogic OFED.VERSION yyyy_mm_dd.hh_mm_ss
0: Version: ChipABI VERSION, InfiniPath_QLE7340, InfiniPath1 VERSION, SW Compat 2
0: Serial: RIB0935M31511 LocalBus: PCIe,5000MHz,x8
0,1: Status: 0xe1 Initted Present IB_link_up IB_configured
0,1: LID=0x23 GUID=0011:7500:005a:6ad0
0,1: HRTBT:Auto LINK:40 Gb/sec (4X QDR)
ssh
3-44 IB0054606-02 A

iba_opp_query

iba_opp_query is used to check the operation of the Distributed SA. You can run it from any node where the Distributed SA is installed and running, to verify that the replica on that node is working correctly. See “iba_opp_query” on
page G-4 for detailed usage information.
# iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x107
Query Parameters:
resv1 0x0000000000000107
dgid ::
sgid ::
dlid 0x75
slid 0x31
hop 0x0
flow 0x0
tclass 0x0
num_path 0x0
pkey 0x0
qos_class 0x0
sl 0x0
mtu 0x0
rate 0x0
pkt_life 0x0
preference 0x0
resv2 0x0
resv3 0x0
Using HCA qib0
Result:
resv1 0x0000000000000107
dgid fe80::11:7500:79:e54a
sgid fe80::11:7500:79:e416
dlid 0x75
slid 0x31
hop 0x0
flow 0x0
tclass 0x0
num_path 0x0
pkey 0xffff
qos_class 0x0
sl 0x1
3–InfiniBand® Cluster Setup and Administration
Checking Cluster and Software Status
IB0054606-02 A 3-45
3–InfiniBand® Cluster Setup and Administration Checking Cluster and Software Status
mtu 0x4
rate 0x6
pkt_life 0x10
preference 0x0
resv2 0x0
resv3 0x0

ibstatus

Another useful program is ibstatus that reports on the status of the local HCAs. Sample usage and output are as follows:
$ ibstatus
Infiniband device 'qib0' port 1 status:
default gid: fe80:0000:0000:0000:0011:7500:005a:6ad0
base lid: 0x23
sm lid: 0x108
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: IB
3-46 IB0054606-02 A

ibv_devinfo

ibv_devinfo queries RDMA devices. Use the -v option to see more information.
Sample usage:
$ ibv_devinfo
hca_id: qib0
fw_ver: 0.0.0
node_guid: 0011:7500:00ff:89a6
sys_image_guid: 0011:7500:00ff:89a6
vendor_id: 0x1175
vendor_part_id: 29216
hw_ver: 0x2
board_id: InfiniPath_QLE7280
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 31
port_lmc: 0x00
3–InfiniBand® Cluster Setup and Administration
Checking Cluster and Software Status

ipath_checkout

ipath_checkout is a bash script that verifies that the installation is correct and
that all the nodes of the network are functioning and mutually connected by the IB fabric. It must be run on a front end node, and requires specification of a nodefile. For example:
$ ipath_checkout [options] nodefile
The nodefile lists the hostnames of the nodes of the cluster, one hostname per line. The format of
hostname1
hostname2
...
For more information on these programs, see “ipath_control” on page G-27,
“ibstatus” on page G-22, and “ipath_checkout” on page G-25.
nodefile is as follows:
IB0054606-02 A 3-47
3–InfiniBand® Cluster Setup and Administration Checking Cluster and Software Status
3-48 IB0054606-02 A
4 Running MPI on QLogic
Adapters
This section provides information on using the Message-Passing Interface (MPI) on QLogic IB HCAs. Examples are provided for setting up the user environment, and for compiling and running MPI programs.

Introduction

The MPI standard is a message-passing library or collection of routines used in distributed-memory parallel programming. It is used in data exchange and task synchronization between processes. The goal of MPI is to provide portability and efficient implementation across different platforms and architectures.

MPIs Packaged with QLogic OFED+

The high-performance open-source MPIs packaged with QLogic OFED+ include: Open MPI version 1.4.3, Ohio State University MVAPICH version 1.2, and MVAPICH2 version 1.7. These MPIs are offered in versions built with the high-performance Performance Scaled Messaging (PSM) interface and versions built run over IB Verbs. There are also the commercial MPIs which are not packaged with QOFED+, Intel MPI and Platform MPI, which both make use of the PSM application programming interface (API) and can both run over IB Verbs or over user direct access programming library (uDAPL), which uses IB Verbs. For more information on other MPIs, see Section 5 Using Other MPIs.

Open MPI

Open MPI is an open source MPI-2 implementation from the Open MPI Project. Pre-compiled versions of Open MPI version 1.4.3 that run over PSM and are built with the GCC, PGI, and Intel compilers are available with the QLogic download. Open MPI that runs over Verbs is also available.
Open MPI can be managed with the mpi-selector utility, as described in
“Managing MVAPICH, and MVAPICH2 with the mpi-selector Utility” on page 5-5.
IB0054606-02 A 4-1
4–Running MPI on QLogic Adapters Open MPI

Installation

Follow the instructions in the QLogic Fabric Software Installation Guide for installing Open MPI.
Newer versions of Open MPI released after this QLogic OFED+ release will not be supported (refer to the OFED+ Host Software Release Notes for version numbers). QLogic does not recommend installing any newer versions of Open MPI. If a newer version is required it can be found on the Open MPI web site
(http://www.open-mpi.org/

Setup

When using the mpi-selector tool, the necessary $PATH and $LD_LIBRARY_PATH setup is done.
When not using the mpi-selector tool, put the Open MPI installation directory in the PATH by adding the following to PATH:
$mpi_home/bin
) and installed after QLogic OFED+ has been installed.
Where $mpi_home is the directory path where Open MPI is installed.

Compiling Open MPI Applications

QLogic recommends that you use the included wrapper scripts that invoke the underlying compiler (see Ta bl e 4 -1).
Table 4-1. Open MPI Wrapper Scripts
Wrapper Script Name Language
mpicc C
mpiCC, mpicxx, or mpic++ C++
mpif77 Fortran 77
mpif90 Fortran 90
To compile your program in C, type the following:
$ mpicc mpi_app_name.c -o mpi_app_name
These scripts all provide the command line options listed in Tab le 4 -2 .
4-2 IB0054606-02 A
4–Running MPI on QLogic Adapters
Table 4-2. Command Line Options for Scripts
Command Meaning
Open MPI
man mpicc (mpif90, mpicxx, etc.)
-showme Lists each of the compiling and linking commands that would
-showme:compile Shows the compile-time flags that would be supplied to the
-showme:link Shows the linker flags that would be supplied to the compiler
Provides help
be called without actually invoking the underlying compiler
compiler
for the link phase.
These wrapper scripts pass most options on to the underlying compiler. Use the documentation for the underlying compiler (gcc, icc, pgcc, etc. ) to determine what options to use for your application.
QLogic strongly encourages using the wrapper compilers instead of attempting to link to the Open MPI libraries manually. This allows the specific implementation of Open MPI to change without forcing changes to linker directives in users' Makefiles.

Create the mpihosts File

Create an MPI hosts file in the same working directory where Open MPI is installed. The MPI hosts file contains the host names of the nodes in your cluster that run the examples, with one host name per line. Name this file mpihosts. The contents can be in the following format:
More details on the mpihosts file can be found in “mpihosts File Details” on
page 4-12.

Running Open MPI Applications

The Open MPI choices available from mpi-selector --list are:
openmpi_gcc-1.4.3
openmpi_gcc_qlc-1.4.3
openmpi_intel_qlc-1.4.3
openmpi_pgi_qlc-1.4.3.
IB0054606-02 A 4-3
4–Running MPI on QLogic Adapters Open MPI
The first choice will use verbs by default, and any with the _qlc string will use PSM by default. If you chose openmpi_gcc_qlc-1.4.3, for example, then the following simple mpirun command would run using PSM:
$ mpirun -np 4 -machinefile mpihosts mpi_app_name
To run over IB Verbs instead of the default PSM transport in
openmpi_gcc_qlc-1.4.3, use this mpirun command line:
$ mpirun -np 4 -machinefile mpihosts --mca btl sm --mca btl
openib,self --mca mtl ^psm mpi_app_name
The following command enables shared memory:
--mca btl sm
The following command enables openib transport and communication to self:
--mca btl openib, self
The following command disables PSM transport:
--mca mtl ^psm
In these commands, btl stands for byte transport layer and mtl for matching transport layer.
PSM transport works in terms of MPI messages. OpenIB transport works in terms of byte streams.
Alternatively, you can use Open MPI with a sockets transport running over IPoIB, for example:
$ mpirun -np 4 -machinefile mpihosts --mca btl sm --mca btl tcp,self --mca btl_tcp_if_exclude eth0 --mca btl_tcp_if_include ib0 --mca mtl ^psm mpi_app_name
Note that eth0 and psm are excluded, while ib0 is included. These instructions may need to be adjusted for your interface names.
Note that in Open MPI, machinefile is also known as the hostfile.

Further Information on Open MPI

For more information about Open MPI, see:
http://www.open-mpi.org/
http://www.open-mpi.org/faq
4-4 IB0054606-02 A

Configuring MPI Programs for Open MPI

When configuring an MPI program (generating header files and/or Makefiles) for Open MPI, you usually need to specify mpicc, mpicxx, and so on as the compiler, rather than gcc, g++, etc.
Specifying the compiler is typically done with commands similar to the following, assuming that you are using sh or bash as the shell:
$ export CC=mpicc
$ export CXX=mpicxx
$ export F77=mpif77
$ export F90=mpif90
The shell variables will vary with the program being configured. The following examples show frequently used variable names. If you use csh, use commands similar to the following:
$ setenv CC mpicc
4–Running MPI on QLogic Adapters
Open MPI
You may need to pass arguments to configure directly, for example:
$ ./configure -cc=mpicc -fc=mpif77 -c++=mpicxx
-c++linker=mpicxx
You may also need to edit a Makefile to achieve this result, adding lines similar to:
CC=mpicc
F77=mpif77
F90=mpif90
CXX=mpicxx
In some cases, the configuration process may specify the linker. QLogic recommends that the linker be specified as mpicc, mpif90, etc. in these cases. This specification automatically includes the correct flags and libraries, rather than trying to configure to pass the flags and libraries explicitly. For example:
LD=mpif90
These scripts pass appropriate options to the various compiler passes to include header files, required libraries, etc. While the same effect can be achieved by passing the arguments explicitly as flags, the required arguments may vary from release to release, so it is good practice to use the provided scripts.

To Use Another Compiler

Open MPI and all other MPIs that run on InfiniBand®, support a number of compilers, in addition to the default GNU Compiler Collection (GCC, including gcc, g++ and gfortran) versions 3.3 and later. These include the PGI 8.0, through 11.9; and Intel 9.x, 10.1, 11.x, and 12.x.
IB0054606-02 A 4-5
4–Running MPI on QLogic Adapters Open MPI
The easiest way to use other compilers with any MPI that comes with QLogic OFED+ is to use mpi-selector to change the selected MPI/compiler combination, see “Managing MVAPICH, and MVAPICH2 with the mpi-selector
Utility” on page 5-5.
These compilers can be invoked on the command line by passing options to the wrapper scripts. Command line options override environment variables, if set.
Tables 4-3 and 4-4 show the options for each of the compilers.
In each case,
..... stands for the remaining options to the mpicxx script, the
options to the compiler in question, and the names of the files that it operates.
Table 4-3. Intel
Compiler Command
C $ mpicc -cc=icc .....
C++ $ mpicc -CC=icpc
Fortran 77 $ mpif77 -fc=ifort .....
Fortran 90/95 $ mpif90 -f90=ifort .....
$ mpif95 -f95=ifort .....
Table 4-4. Portland Group (PGI)
Compiler Command
C mpicc -cc=pgcc .....
C++ mpicc -CC=pgCC
Fortran 77 mpif77 -fc=pgf77 .....
Fortran 90/95 mpif90 -f90=pgf90 .....
mpif95 -f95=pgf95 .....
Also, use mpif77,
mpif90, or mpif95 for linking; otherwise, .true. may have
the wrong value.
If you are not using the provided scripts for linking, link a sample program using the -show option as a test (without the actual build) to see what libraries to add to your link line. Some examples of the using the PGI compilers follow.
4-6 IB0054606-02 A
For Fortran 90 programs:
$ mpif90 -f90=pgf90 -show pi3f90.f90 -o pi3f90
pgf90 -I/usr/include/mpich/pgi5/x86_64 -c -I/usr/include
pi3f90.f90 -c
pgf90 pi3f90.o -o pi3f90 -lmpichf90 -lmpich
-lmpichabiglue_pgi5
Fortran 95 programs will be similar to the above.
For C programs:
$ mpicc -cc=pgcc -show cpi.c
pgcc -c cpi.c
pgcc cpi.o -lmpich -lpgftnrtl -lmpichabiglue_pgi5
Compiler and Linker Variables
When you use environment variables (e.g., $MPICH_CC) to select the compiler mpicc (and others) will use, the scripts will also set the matching linker variable
(for example, $MPICH_CLINKER), if it is not already set. When both the environment variable and command line options are used (-cc=gcc), the command line variable is used.
4–Running MPI on QLogic Adapters
Open MPI
When both the compiler and linker variables are set, and they do not match for the compiler you are using, the MPI program may fail to link; or, if it links, it may not execute correctly.

Process Allocation

Normally MPI jobs are run with each node program (process) being associated with a dedicated QLogic IB adapter hardware context that is mapped to a CPU.
If the number of node programs is greater than the available number of hardware contexts, software context sharing increases the number of node programs that can be run. Each adapter supports four software contexts per hardware context, so up to four node programs (from the same MPI job) can share that hardware context. There is a small additional overhead for each shared context.
Table 4-5 shows the maximum number of contexts available for each adapter.
IB0054606-02 A 4-7
4–Running MPI on QLogic Adapters Open MPI
Table 4-5. Available Hardware and Software Contexts
Adapter
QLE7342/
QLE7340
Available Hardware
Contexts (same as number
of supported CPUs)
16 64
Available Contexts when
Software Context Sharing is
Enabled
The default hardware context/CPU mappings can be changed on the QDR IB Adapters (QLE734x). See “IB Hardware Contexts on the QDR IB Adapters” on
page 4-8 for more details.
Context sharing is enabled by default. How the system behaves when context sharing is enabled or disabled is described in “Enabling and Disabling Software
Context Sharing” on page 4-9.
When running a job in a batch system environment where multiple jobs may be running simultaneously, it is useful to restrict the number of IB contexts that are made available on each node of an MPI. See “Restricting IB Hardware Contexts in
a Batch Environment” on page 4-10.
Errors that may occur with context sharing are covered in “Context Sharing Error
Messages” on page 4-11.
There are multiple ways of specifying how processes are allocated. You can use the mpihosts file, the -np and -ppn options with mpirun, and the MPI_NPROCS and PSM_SHAREDCONTEXTS_MAX environment variables. How these all are set are covered later in this document.
IB Hardware Contexts on the QDR IB Adapters
On the QLE7340 and QLE7342 QDR adapters, adapter receive resources are statically partitioned across the IB contexts according to the number of IB contexts enabled. The following defaults are automatically set according to the number of online CPUs in the node:
For four or less CPUs: 6 (4 + 2)
For five to eight CPUs: 10 (8 + 2)
For nine or more CPUs: 18 (16 + 2)
The one additional context on QDR adapters are to support the kernel on each port.
4-8 IB0054606-02 A
4–Running MPI on QLogic Adapters
NOTE
Open MPI
Performance can be improved in some cases by disabling IB hardware contexts when they are not required so that the resources can be partitioned more effectively.
To disable this behavior, explicitly configure for the number you want to use with the cfgctxts module parameter in the modprobe configuration file (see
“Affected Files” on page 3-37 for exact file name and location).
The maximum that can be set is 18 on QDR IB Adapters.
The driver must be restarted if this default is changed. See “Managing the ib_qib
Driver” on page 3-21.
In rare cases, setting contexts automatically on QDR IB Adapters can lead to sub-optimal performance where one or more IB hardware contexts have been disabled and a job is run that requires software context sharing. Since the algorithm ensures that there is at least one IB context per online CPU, this case occurs only if the CPUs are over-subscribed with processes (which is not normally recommended). In this case, it is best to override the default to use as many IB contexts as are available, which minimizes the amount of software context sharing required.
Enabling and Disabling Software Context Sharing
By default, context sharing is enabled; it can also be specifically disabled.
Context Sharing Enabled: The MPI library provides PSM the local process layout so that IB contexts available on each node can be shared if necessary; for example, when running more node programs than contexts. All PSM jobs assume that they can make use of all available IB contexts to satisfy the job requirement and try to give a context to each process.
When context sharing is enabled on a system with multiple QLogic IB adapter boards (units) and the IPATH_UNIT environment variable is set, the number of IB contexts made available to MPI jobs is restricted to the number of contexts available on that unit. When multiple IB devices are present, it restricts the use to a specific IB Adapter unit. By default, all configured units are used in round robin order.
Context Sharing Disabled: Each node program tries to obtain exclusive access to an IB hardware context. If no hardware contexts are available, the job aborts.
IB0054606-02 A 4-9
4–Running MPI on QLogic Adapters
NOTE
Open MPI
To explicitly disable context sharing, set this environment variable in one of the two following ways:
PSM_SHAREDCONTEXTS=0
PSM_SHAREDCONTEXTS=NO
The default value of PSM_SHAREDCONTEXTS is 1 (enabled).
Restricting IB Hardware Contexts in a Batch Environment
If required for resource sharing between multiple jobs in batch systems, you can restrict the number of IB hardware contexts that are made available on each node of an MPI job by setting that number in the PSM_SHAREDCONTEXTS_MAX or PSM_RANKS_PER_CONTEXT environment variables.
For example, if you are running two different jobs on nodes using a QDR IB HCA, set PSM_SHAREDCONTEXTS_MAX to 8 instead of the default 16. Each job would then have at most 8 of the 16 available hardware contexts. Both of the jobs that want to share a node would have to set PSM_SHAREDCONTEXTS_MAX=8.
MPIs use different methods for propagating environment variables to the nodes used for the job; See Section 7 for examples. Open MPI will automatically propagate PSM environment variables.
Setting PSM_SHAREDCONTEXTS_MAX=8 as a clusterwide default would unnecessarily penalize nodes that are dedicated to running single jobs. QLogic recommends that a per-node setting, or some level of coordination with the job scheduler with setting the environment variable should be used.
The number of contexts can be explicitly configured with the cfgctxts module parameter. This will override the default settings based on the number of CPUs present on each node. See “IB Hardware Contexts on the QDR IB Adapters” on
page 4-8.
PSM_RANKS_PER_CONTEXT provides an alternate way of specifying how PSM should use contexts. The variable is the number of ranks that will share each hardware context. The supported values are 1, 2, 3 and 4, where 1 is no context sharing, 2 is 2-way context sharing, 3 is 3-way context sharing and 4 is the maximum 4-way context sharing. The same value of PSM_RANKS_PER_CONTEXT must be used for all ranks on a node, and typically, you would use the same value for all nodes in that job. Either PSM_RANKS_PER_CONTEXT or PSM_SHAREDCONTEXTS_MAX would be used in a particular job, but not both. If both are used and the settings are incompatible, then PSM will report an error and the job will fail to start up.
4-10 IB0054606-02 A
Context Sharing Error Messages
The error message when the context limit is exceeded is:
No free InfiniPath contexts available on /dev/ipath
This message appears when the application starts.
Error messages related to contexts may also be generated by ipath_checkout or mpirun. For example:
PSM found 0 available contexts on InfiniPath device
The most likely cause is that the cluster has processes using all the available PSM contexts. Clean up these processes before restarting the job.
Running in Shared Memory Mode
Open MPI supports running exclusively in shared memory mode; no QLogic adapter is required for this mode of operation. This mode is used for running applications on a single node rather than on a cluster of nodes.
4–Running MPI on QLogic Adapters
Open MPI
To add pre-built applications (benchmarks), add /usr/mpi/gcc/openmpi-1.4.3-qlc/tests/osu_benchmarks-3.1.1 to your PATH (or if you installed the MPI in another location: add $MPI_HOME/tests/osu_benchmarks-3.1.1 to your PATH).
To enable shared memory mode, use a single node in the mpihosts file. For example, if the file were named onehost and it is in the working directory, the following would be entered:
$ cat /tmp/onehost
idev-64 slots=8
Enabling the shared memory mode as previously described uses a feature of Open-MPI host files to list the number of slots, which is the number of possible MPI processes (aka ranks) that you want to run on the node. Typically this is set equal to the number of processor cores on the node. A hostfile with 8 lines containing 'idev-64' would function identically. You can use this hostfile and run:
$ mpirun -np=2 -hostfile onehost osu_latency
to measure MPI latency between two cores on the same host using shared-memory, or
$ mpirun -np=2 -hostfile onehost osu_bw
to measure MPI unidirectional bandwidth using shared memory.
IB0054606-02 A 4-11
4–Running MPI on QLogic Adapters Open MPI

mpihosts File Details

As noted in “Create the mpihosts File” on page 4-3, a hostfile (also called machines file, nodefile, or hostsfile) has been created in your current working directory. This file names the nodes that the node programs may run.
The two supported formats for the hostfile are:
hostname1
hostname2
...
or
hostname1 slots=process_count
hostname2 slots=process_count
...
In the first format, if the -np count (number of processes to spawn in the mpirun command) is greater than the number of lines in the machine file, the hostnames will be repeated (in order) as many times as necessary for the requested number of node programs.
Also in the first format, if the -np count is less than the number of lines in the machine file, mpirun still processes the entire file and tries to pack processes to use as few hosts as possible in the hostfile. This is a different behavior than MVAPICH or the no-longer-supported QLogic MPI.
In the second format, process_count can be different for each host, and is normally the number of available processors on the node. When not specified, the default value is one. The value of process_count determines how many node programs will be started on that host before using the next entry in the hostfile file. When the full hostfile is processed, and there are additional processes requested, processing starts again at the start of the file.
It is generally recommended to use the second format and various command line options to schedule the placement of processes to nodes and cores. For example, the mpirun option -npernode can be used to specify (similar to the Intel MPI option -ppn) how many processes should be scheduled on each node on each pass through the hostfile. In the case of nodes with 8 cores each, if the hostfile line is specified as hostname1 slots=8 max-slots=8, then Open MPI will assign a maximum of 8 processes to the node and there can be no over-subscription of the 8 cores.
There are several alternative ways of specifying the hostfile:
4-12 IB0054606-02 A
4–Running MPI on QLogic Adapters
Open MPI
The command line option -hostfile can be used as shown in the
following command line:
$mpirun -np n -hostfile mpihosts [other options]
program-name
or -machinefile is a synonym for -hostfile. In this case, if the named file cannot be opened, the MPI job fails.
An alternate mechanism to -
-hosts, or --host followed by a host list. The host list can follow one of the following examples:
host-01, or
host-01,host-02,host-04,host-06,host-07,host-08
In the absence of the -hostfile option, the -H option, mpirun uses the
file ./mpihosts, if it exists.
If you are working in the context of a batch queuing system, it may provide a job submission script that generates an appropriate mpihosts file. More details about how to schedule processes to nodes with Open MPI refer to the Open MPI website:
http://www.open-mpi.org/faq/?category=running#mpirun-scheduling

Using Open MPI’s mpirun

The script mpirun is a front end program that starts a parallel MPI job on a set of nodes in an IB cluster. mpirun may be run on any x86_64 machine inside or outside the cluster, as long as it is on a supported Linux distribution, and has TCP connectivity to all IB cluster machines to be used in a job.
hostfile for specifying hosts is the -H,
The script starts, monitors, and terminates the node programs. mpirun uses ssh (secure shell) to log in to individual cluster machines and prints any messages that the node program prints on stdout or stderr, on the terminal where
mpirun is invoked.
The general syntax is:
$ mpirun [mpirun_options...] program-name [program options]
program-name is usually the pathname to the executable MPI program. When the MPI program resides in the current directory and the current directory is not in your search path, then program-name
./program-name
must begin with ‘./’, for example:
Unless you want to run only one instance of the program, use the -np option, for example:
$ mpirun -np n [other options] program-name
IB0054606-02 A 4-13
4–Running MPI on QLogic Adapters Open MPI
This option spawns n instances of program-name. These instances are called node programs.
Generally, mpirun tries to distribute the specified number of processes evenly among the nodes listed in the hostfile. However, if the number of processes exceeds the number of nodes listed in the hostfile, then some nodes will be assigned more than one instance of the program.
Another command line option, -npernode, instructs mpirun to assign a fixed number p of node programs (processes) to each node, as it distributes n instances among the nodes:
$ mpirun -np n -npernode p -hostfile mpihosts program-name
This option overrides the slots=process_count specifications, if any, in the lines of the mpihosts file. As a general rule, mpirun distributes the n node programs among the nodes without exceeding, on any node, the maximum number of instances specified by the slots=process_count option. The value of the slots=process_count
-npernode command line option or in the mpihosts file.
Typically, the number of node programs should not be larger than the number of processor cores, at least not for compute-bound programs.
option is specified by either the
This option specifies the number of processes to spawn. If this option is not set, then environment variable MPI_NPROCS is checked. If MPI_NPROCS is not set, the default is to determine the number of processes based on the number of hosts in the hostfile or the list of hosts -H or --host.
-npernode processes-per-node
This option creates up to the specified number of processes per node.
Each node program is started as a process on one node. While a node program may fork child processes, the children themselves must not call MPI functions.
There are many more mpirun options for scheduling where the processes get assigned to nodes. See man mpirun for details.
mpirun monitors the parallel MPI job, terminating when all the node programs in that job exit normally, or if any of them terminates abnormally.
Killing the mpirun program kills all the processes in the job. Use CTRL+C to kill mpirun.

Console I/O in Open MPI Programs

Open MPI directs UNIX standard input to /dev/null on all processes except the MPI_COMM_WORLD rank 0 process. The MPI_COMM_WORLD rank 0 process
inherits standard input from mpirun.
4-14 IB0054606-02 A
4–Running MPI on QLogic Adapters
NOTE
Open MPI
The node that invoked mpirun need not be the same as the node where the MPI_COMM_WORLD rank 0 process resides. Open MPI handles the redirection of mpirun's standard input to the rank 0 process.
Open MPI directs UNIX standard output and error from remote nodes to the node that invoked mpirun and prints it on the standard output/error of mpirun. Local processes inherit the standard output/error of mpirun and transfer to it directly.
It is possible to redirect standard I/O for Open MPI applications by using the typical shell redirection procedure on mpirun.
$ mpirun -np 2 my_app < my_input > my_output
Note that in this example only the MPI_COMM_WORLD rank 0 process will receive the stream from my_input on stdin. The stdin on all the other nodes will be tied to
/dev/null. However, the stdout from all nodes will be collected into the my_output file.

Environment for Node Programs

The following information can be found in the Open MPI man page and is repeated here for easy of use.
Remote Execution
Open MPI requires that the PATH environment variable be set to find executables on remote nodes (this is typically only necessary in rsh- or ssh-based environments -- batch/scheduled environments typically copy the current environment to the execution of remote jobs, so if the current environment has PATH and/or LD_LIBRARY_PATH set properly, the remote nodes will also have it set properly). If Open MPI was compiled with shared library support, it may also be necessary to have the LD_LIBRARY_PATH environment variable set on remote nodes as well (especially to find the shared libraries required to run user MPI applications).
It is not always desirable or possible to edit shell startup files to set PATH and/or LD_LIBRARY_PATH. The --prefix option is provided for some simple configurations where this is not possible.
The --prefix option takes a single argument: the base directory on the remote node where Open MPI is installed. Open MPI will use this directory to set the remote PATH and LD_LIBRARY_PATH before executing any Open MPI or user applications. This allows running Open MPI jobs without having pre-configured the PATH and LD_LIBRARY_PATH on the remote nodes.
IB0054606-02 A 4-15
4–Running MPI on QLogic Adapters Open MPI
Open MPI adds the base-name of the current node’s bindir (the directory where Open MPI’s executables are installed) to the prefix and uses that to set the PATH on the remote node. Similarly, Open MPI adds the base-name of the current node’s libdir (the directory where Open MPI’s libraries are installed) to the prefix and uses that to set the LD_LIBRARY_PATH on the remote node. For example:
Local bindir: /local/node/directory/bin
Local libdir: /local/node/directory/lib64
If the following command line is used:
% mpirun --prefix /remote/node/directory
Open MPI will add /remote/node/directory/bin to the PATH and /remote/node/directory/lib64 to the D_LIBRARY_PATH on the remote
node before attempting to execute anything.
Note that --prefix can be set on a per-context basis, allowing for different values for different nodes.
The --prefix option is not sufficient if the installation paths on the remote node are different than the local node (for example, if /lib is used on the local node but /lib64 is used on the remote node), or if the installation paths are something other than a subdirectory under a common prefix.
Note that executing mpirun using an absolute pathname is equivalent to specifying --prefix without the last subdirectory in the absolute pathname to
mpirun. For example:
% /usr/local/bin/mpirun ... is equivalent to
% mpirun --prefix /usr/local
Exported Environment Variables
All environment variables that are named in the form OMPI_* will automatically be exported to new processes on the local and remote nodes. The -x option to mpirun can be used to export specific environment variables to the new processes. While the syntax of the -x option allows the definition of new variables. Note that the parser for this option is currently not very sophisticated, it does not understand quoted values. Users are advised to set variables in the environment and use -x to export them, not to define them.
4-16 IB0054606-02 A
Setting MCA Parameters
The -mca switch allows the passing of parameters to various Modular Component Architecture (MCA) modules. MCA modules have direct impact on MPI programs because they allow tunable parameters to be set at run time (such as which BTL communication device driver to use, what parameters to pass to that BTL, and so on.).
The -mca switch takes two arguments: key and value. The key argument generally specifies which MCA module will receive the value. For example, the key btl is used to select which BTL to be used for transporting MPI messages. The value argument is the value that is passed. For example:
mpirun -mca btl tcp,self -np 1 foo
Tells Open MPI to use the tcp and self BTLs, and to run a single copy of foo an allocated node.
mpirun -mca btl self -np 1 foo
Tells Open MPI to use the self BTL, and to run a single copy of foo an allocated node.
4–Running MPI on QLogic Adapters
Open MPI
The -mca switch can be used multiple times to specify different key and/or value arguments. If the same key is specified more than once, the values are concatenated with a comma (",") separating them.
Note that the -mca switch is simply a shortcut for setting environment variables. The same effect may be accomplished by setting corresponding environment variables before running mpirun. The form of the environment variables that Open MPI sets is:
OMPI_MCA_key=value
Thus, the -mca switch overrides any previously set environment variables. The
-mca settings similarly override MCA parameters set in these two files, which are
searched (in order):
1. $HOME/.openmpi/mca-params.conf: The user-supplied set of values takes the highest precedence.
2. $prefix/etc/openmpi-mca-params.conf: The system-supplied set of values has a lower precedence.
IB0054606-02 A 4-17
4–Running MPI on QLogic Adapters Open MPI

Environment Variables

Table 4-6 contains a summary of the environment variables that are relevant to
any PSM including Open MPI. Ta bl e 4- 7 is more relevant for the MPI programmer or script writer, because these variables are only active after the mpirun command has been issued and while the MPI processes are active. Open MPI provides the environmental variables shown in Tab le 4 -7 that will be defined on every MPI process. Open MPI guarantees that these variables will remain stable throughout future releases.
Table 4-6. Environment Variables Relevant for any PSM
Name Description
OMP_NUM_THREADS This variable is used by a compilers’ OpenMP
run-time library.Use this variable to adjust the split between MPI processes and OpenMP threads. Usually, the number of MPI processes (per node) times the number of OpenMP threads will be set to match the number of CPUs per node. An exam­ple would be a node with eight CPUs, running two MPI processes and four
OpenMP threads. In this case, OMP_NUM_THREADS is set to 4. OMP_NUM_THREADS is on a per-node basis, so needs to be propagated to each node used in the job, in a way that your MPI supports.
PSM_SHAREDCONTEXTS This variable overrides automatic context sharing
behavior. YES is equivalent to 1.
Default: 1
PSM_SHAREDCONTEXTS_MAX This variable restricts the number of IB contexts
that are made available on each node of an MPI job.
Up to 16 set automatically based on number of CPUs on node
PSM_DEVICES Set this variable to enable running in shared mem-
ory mode on a single node..
Default: self,ipath
4-18 IB0054606-02 A
4–Running MPI on QLogic Adapters
Open MPI
Table 4-6. Environment Variables Relevant for any PSM (Continued)
Name Description
IPATH_NO_CPUAFFINITY When set to 1, the PSM library will skip trying to
set processor affinity. This is also skipped if the processor affinity mask is set to a list smaller than the number of processors prior to MPI_Init() being called. Otherwise the initialization code sets cpu affinity in a way that optimizes cpu and memory locality and load.
Default: Unset
IPATH_PORT Specifies the port to use for the job, 1 or 2.
Specifying 0 will autoselect IPATH_PORT.
Default: Unset
IPATH_UNIT This variable is for context sharing. When multiple
IB devices are present, and the IPATH_UNIT envi­ronment variable is set, the number of IB contexts made available to MPI jobs will be restricted to the number of contexts available on that unit. By default, IPATH_UNIT is unset and contexts from all configured units are made available to MPI jobs in round robin order.
Default: Unset
IPATH_HCA_SELECTION_ALG This variable provides user-level support to spec-
ify HCA/port selection algorithm through the envi­ronment variable. The default option is a round robin that allocates MPI processes to the HCAs in an alternating or round robin fashion. The older mechanism option is packed that fills all contexts on one HCA before allocating from the next HCA.
For example: In the case of using two single-port HCAs, the default or IPATH_HCA_SELECTION_ALG= Round Robin setting, will allow 2 or more MPI processes per node to use both HCAs and to achieve perfor­mance improvements compared to what can be achieved with one HCA.
Default: Round Robin
IPATH_SL Service Level for QDR Adapters, these are used
to work with the switch's Vfabric feature.
Default: Unset
IB0054606-02 A 4-19
4–Running MPI on QLogic Adapters Open MPI
Table 4-6. Environment Variables Relevant for any PSM (Continued)
Name Description
LD_LIBRARY_PATH This variable specifies the path to the run-time
Table 4-7. Environment Variables Relevant for Open MPI
Name Description
OMPI_COMM_WORLD_SIZE This environment variable selects the number of
OMPI_COMM_WORLD_RANK This variable is used to select the MPI rank of this
OMPI_COMM_WORLD_LOCAL_RANK This environment variable selects the relative rank
library.
Default: Unset
processes in this process' MPI Comm_World
process
of this process on this node within it job. For example, if four processes in a job share a node, they will each be given a local rank ranging from 0 to 3.
OMPI_UNIVERSE_SIZE This environment variable selects the number of
process slots allocated to this job. Note that this may be different than the number of processes in the job.

Job Blocking in Case of Temporary IB Link Failures

By default, as controlled by mpirun’s quiescence parameter -q, an MPI job is killed for quiescence in the event of an IB link failure (or unplugged cable). This quiescence timeout occurs under one of the following conditions:
A remote rank’s process cannot reply to out-of-band process checks.
MPI is inactive on the IB link for more than 15 minutes.
To keep remote process checks but disable triggering quiescence for temporary IB link failures, use the -disable-mpi-progress-check option with a nonzero -q option. To disable quiescence triggering altogether, use -q 0. No matter how these options are used, link failures (temporary or other) are always logged to syslog.
If the link is down when the job starts and you want the job to continue blocking until the link comes up, use the -t -1 option.
4-20 IB0054606-02 A
4–Running MPI on QLogic Adapters

Open MPI and Hybrid MPI/OpenMP Applications

Open MPI and Hybrid MPI/OpenMP Applications
Open MPI supports hybrid MPI/OpenMP applications, provided that MPI routines are called only by the master OpenMP thread. This application is called the funneled thread model. Instead of MPI_Init/MPI_INIT (for C/C++ and Fortran respectively), the program can call MPI_Init_thread/MPI_INIT_THREAD to determine the level of thread support, and the value MPI_THREAD_FUNNELED will be returned.
To use this feature, the application must be compiled with both OpenMP and MPI code enabled. To do this, use the -openmp or -mp flag (depending on your compiler) on the mpicc
As mentioned previously, MPI routines can be called only by the master OpenMP thread. The hybrid executable is executed as usual using mpirun, but typically only one MPI process is run per node and the OpenMP library will create additional threads to utilize all CPUs on that node. If there are sufficient CPUs on a node, you may want to run multiple MPI processes and multiple OpenMP threads per node.
compile line.
The number of OpenMP threads is typically controlled by the OMP_NUM_THREADS environment variable in the compilers’ OpenMP products, but is not an Open MPI environment variable.) Use this variable to adjust the split between MPI processes and OpenMP threads. Usually, the number of MPI processes (per node) times the number of OpenMP threads will be set to match the number of CPUs per node. An example case would be a node with four CPUs, running one MPI process and four OpenMP threads. In this case, OMP_NUM_THREADS is set to four. OMP_NUM_THREADS is on a per-node basis.
See “Environment for Node Programs” on page 4-15 for information on setting environment variables.
.bashrc file. (OMP_NUM_THREADS is used by other
IB0054606-02 A 4-21
4–Running MPI on QLogic Adapters
NOTE

Debugging MPI Programs

With Open MPI, and other PSM-enabled MPIs, you will typically want to turn off PSM's CPU affinity controls so that the OpenMP threads spawned by an MPI process are not constrained to stay on the CPU core of that process, causing over-subscription of that CPU. Accomplish this using the
IPATH_NO_CPUAFFINITY=1 setting as follows: OMP_NUM_THREADS=8 (typically set in the ~/.bashrc file)
mprun -np 2 -H host1,host2 -x IPATH_NO_CPUAFFINITY=1 ./hybrid_app
In this case, typically there would be 8 or more CPU cores on the host1 and host2 nodes, and this job would run on a total of 16 threads, 8 on each node. You can use 'top' and then '1' to monitor that load is distributed to 8 different CPU cores in this case.
[Both the OMP_NUM_THREADS and IPATH_NO_CPUAFFINITY can be set in .bashrc or both on the command line after -x options.]
When there are more threads than CPUs, both MPI and OpenMP performance can be significantly degraded due to over-subscription of the CPUs
Debugging MPI Programs
Debugging parallel programs is substantially more difficult than debugging serial programs. Thoroughly debugging the serial parts of your code before parallelizing is good programming practice.

MPI Errors

Almost all MPI routines (except MPI_Wtime and MPI_Wtick) return an error code; either as the function return value in C functions or as the last argument in a Fortran subroutine call. Before the value is returned, the current MPI error handler is called. By default, this error handler aborts the MPI job. Therefore, you can get information about MPI exceptions in your code by providing your own handler for MPI_ERRORS_RETURN. See the man page for the MPI_Errhandler_set for details.
See the standard MPI documentation referenced in Appendix H for details on the MPI error codes.

Using Debuggers

See http://www.open-mpi.org/faq/?category=debugging for details on
debugging with Open MPI.
4-22 IB0054606-02 A
4–Running MPI on QLogic Adapters
NOTE
Debugging MPI Programs
The TotalView® debugger can be used with the Open MPI supplied in this release. Consult the TotalView documentation for more information:
http://www.open-mpi.org/faq/?category=running#run-with-tv
IB0054606-02 A 4-23
4–Running MPI on QLogic Adapters Debugging MPI Programs
4-24 IB0054606-02 A

5 Using Other MPIs

This section provides information on using other MPI implementations. Detailed information on using Open MPI is provided in Section 4, and will be covered in this Section in the context of choosing among multiple MPIs or in tables which compare the multiple MPIs available.

Introduction

Support for multiple high-performance MPI implementations has been added. Most implementations run over both PSM and OpenFabrics Verbs (see
Table 5-1). To choose which MPI to use, use the mpi-selector-menu
command, as described in “Managing MVAPICH, and MVAPICH2 with the
mpi-selector Utility” on page 5-5.
Table 5-1. Other Supported MPI Implementations
MPI
Implementation
Open MPI 1.4.3 PSM
MVAPICH version 1.2
MVAPICH2 version 1.7
Platform MPI 8 PSM
Intel MPI version 4.0
Runs
Over
Verbs
PSM Verbs
PSM Verbs
Verbs
TMI/PSM, uDAPL
Compiled
With
GCC, Intel, PGI
GCC, Intel, PGI
GCC, Intel, PGI
GCC (default)
GCC (default)
Comments
Provides some MPI-2 functionality (one-sided operations and dynamic pro­cesses).
Available as part of the QLogic download.
Can be managed by mpi-selector.
Provides MPI-1 functionality.
Available as part of the QLogic download.
Can be managed by mpi-selector.
Provides MPI-2 Functionality.
Can be managed by MPI-Selector.
Provides some MPI-2 functionality (one-sided operations).
Available for purchase from Platform Computing (an IBM Company).
Provides MPI-1 and MPI-2 functionality.
Available for purchase from Intel.
IB0054606-02 A 5-1
5–Using Other MPIs Installed Layout
Table 5-1. Other Supported MPI Implementations (Continued)
MPI
Implementation
Table Notes MVAPICH and Open MPI have been have been compiled for PSM to support the following versions
of the compilers:
(GNU) gcc 4.1.0 (PGI) pgcc 9.0 (Intel) icc 11.1
These MPI implementations run on multiple interconnects, and have their own mechanisms for selecting the interconnect that runs on. Basic information about using these MPIs is provided in this section. However, for more detailed information, see the documentation provided with the version of MPI that you want to use.

Installed Layout

By default, the MVAPICH, MVAPICH2, and Open MPI are installed in the following directory tree:
/usr/mpi/$compiler/$mpi-mpi_version
The QLogic-supplied MPIs precompiled with the GCC, PGI, and the Intel compilers will also have -qlc appended after the MPI version number.
Runs
Over
Compiled
With
Comments
For example:
/usr/mpi/gcc/openmpi-VERSION-qlc
If a prefixed installation location is used, /usr is replaced by $prefix.
The following examples assume that the default path for each MPI implementation to mpirun is:
/usr/mpi/$compiler/$mpi/bin/mpirun
Again, /usr may be replaced by $prefix. This path is sometimes referred to as $mpi_home/bin/mpirun in the following sections.
See the documentation for Intel MPI, and Platform MPI for their default installation directories.
5-2 IB0054606-02 A

Open MPI

Open MPI is an open source MPI-2 implementation from the Open MPI Project. Pre-compiled versions of Open MPI version 1.4.3 that run over PSM and are built with the GCC, PGI, and Intel compilers are available with the QLogic download.
Details on Open MPI operation are provided in Section 4.

MVAPICH

Pre-compiled versions of MVAPICH 1.2 built with the GNU, PGI, and Intel compilers, and that run over PSM, are available with the QLogic download.
MVAPICH that runs over Verbs and is pre-compiled with the GNU compiler is also available.
MVAPICH can be managed with the mpi-selector utility, as described in
“Managing MVAPICH, and MVAPICH2 with the mpi-selector Utility” on page 5-5.
5–Using Other MPIs
Open MPI

Compiling MVAPICH Applications

As with Open MPI, QLogic recommends that you use the included wrapper scripts that invoke the underlying compiler (see Tab le 5 -2 ).
Table 5-2. MVAPICH Wrapper Scripts
Wrapper Script Name Language
mpicc C
mpiCC, mpicxx C++
mpif77 Fortran 77
mpif90 Fortran 90
To compile your program in C, type:
$ mpicc mpi_app_name.c -o mpi_app_name
To check the default configuration for the installation, check the following file:
/usr/mpi/$compiler/$mpi/etc/mvapich.conf

Running MVAPICH Applications

By default, the MVAPICH shipped with the QLogic OFED+ and IFS (IFS), runs over PSM once it is installed.
IB0054606-02 A 5-3
5–Using Other MPIs MVAPICH2
Here is an example of a simple mpirun command running with four processes:
$ mpirun -np 4 -hostfile mpihosts mpi_app_name
Password-less ssh is used unless the -rsh option is added to the command line above.

Further Information on MVAPICH

For more information about MVAPICH, see:
http://mvapich.cse.ohio-state.edu/

MVAPICH2

Pre-compiled versions of MVAPICH2 1.7 built with the GNU, PGI, and Intel compilers, and that run over PSM, are available with the QLogic download.
MVAPICH2 that runs over Verbs and is pre-compiled with the GNU compiler is also available.
MVAPICH2 can be managed with the mpi-selector utility, as described in
“Managing MVAPICH, and MVAPICH2 with the mpi-selector Utility” on page 5-5.

Compiling MVAPICH2 Applications

As with Open MPI, QLogic recommends that you use the included wrapper scripts that invoke the underlying compiler (see Tab le 5 -3 ).
Table 5-3. MVAPICH Wrapper Scripts
Wrapper Script Name Language
mpicc C
mpiCC, mpicxx C++
mpif77 Fortran 77
mpif90 Fortran 90
To compile your program in C, type:
$ mpicc mpi_app_name.c -o mpi_app_name
To check the default configuration for the installation, check the following file:
/usr/mpi/$compiler/$mpi/etc/mvapich.conf
5-4 IB0054606-02 A
Loading...